Aspect - Probabilistic document clustering

Clustering documents and finding similar documents based on cluster memberships.

Institution: Slovak University of Technology
Technologies used: R statistical language, Java
Inputs: indexed documents and the selected document
Outputs: list of most similar documents to the selected documents

Addressed Problems

Finding relevant information in large sets of documents is actual problem that is visible especially on the Internet. If one finds relevant piece of information, it would be convenient to easily get to similar interesting ones. Searching for similar information is the common task in the process of searching for appropriate job offer. Organizing documents containing job offers into clusters is the answer to this need. Similar documents (documents containing similar words) belong then to the same cluster.

Description

Aspect model is a probabilistic model for soft clustering. It is used in NAZOU project to cluster documents. Based on cluster memberships (expressed by probabilities) it finds similar documents to the selected document.

It consists of two parts:

Schema of aspect model: w stands for words, d for documents and z for clusters.

Schema of aspect model: w stands for words, d for documents and z for clusters.