A Clusterer – Method for Graph Clustering
A.1 Basic Information
When the information space consists of objects and relations between these objects (e.g. job-offers and companies which published those offers), graph representation can serve as the basis for easier user navigation in the information space.
Unfortunately, raw graph presentation can be actually quite overwhelming for the user, hence additional structure needs to be created, allowing the user to select him the best-fitting level of granularity. In such cases, clustering is used to partition the dataset into subsets (clusters) of similar objects, supporting easier navigation in the data-set.
A.1.1 Basic Terms
Cluster |
Group of objects sharing common features |
A.1.2 Method Description
Clusterer composes hierarchy of layers of clusters:
§ Subgraph extraction from ontological repository to relational DB;
§ Hierarchic clustering based on selected clustering method;
§ Graph persistence into relational DB;
§ Communication with ClusterNavigator tool, which visualizes and supports user navigation in the hierarchy of clusters.
Example of a hierarchic composition of clusters is in Fig. 1.
Fig. 1. Example of hierarchic clustering.
A.1.3 Scenarios of Use
Clusterer should be used in following conditions:
§ Graph to be clustered is large and its vertices are well-connected, the edges can be either weighted or unweighted;
Typical scenario of use is present in Fig. 2. Note that Clusterer tool is a stand-alone application in the process of graph extraction and clustering, thus ClusterNavigator tool must be present only in the visualization phase.
Fig. 2. Architecture of the clustering tools – from OWL through graph clustering to user navigation.
A.1.4 External Links and Publications
§ Suchal J., Vojtek P., Frivolt G.: Interactive Navigation in Large Graphs based on Clustering. In Workshop of Tools for Acquisition, Organization and Maintenance of Knowledge in an Environment of Heterogeneous Information Resources, Poľana, Slovakia, 2007.
§ Frivolt G., Suchal J., Veselý R., Vojtek P., Vozár O., Bieliková M.: Creation, Population and Preprocessing of Experimental Data Sets for Evaluation of Applications for the Semantic Web. In SOFSEM’08, Nový Smokovec, Slovakia. LNCS, 2008.
§ JUNG: Java Universal Network/Graph Framework (http://jung.sourceforge.net/)
§ Sesame: RDF Schema Querying and Storage (http://www.openrdf.org/)
A.2 Integration Manual
Clusterer is developed in Java (Standard Edition 6).
A.2.1 Dependencies
Clusterer uses MySQL database. Visualization of the structure of clusters is dependent on ClusterNavigator tool.
A.2.2 Installation
Running Clusterer as server requires following steps:
A.2.3 Configuration
Edit clusterer.properties file:
A.2.4 Integration Guide
Clusterer can be used in a following way:
§ Extract subgraph of instances and relations between them from OWL.
§ Create cluster hierarchy;
§ Provide access to clusters stored in relational database (in the visualization and navigation in clusters).
Typical scenario of use of the tool Clusterer is following:
The process of graph extraction from ontology uses Sesame in order to access the OWL file. Nodes and edges to be extracted are defined in an XML file. An example of a configuration, where instances of types JobOffer, Settlement and ProfessionClassification produce graph nodes and predicate hasDutyLocation creates edges between them is following:
<nodes>
<node name="n1" predicateConstraint="rdf:type" objectConstraint="http://nazou.fiit.stuba.sk/nazou/ontologies/v0.6.17/offer-job#JobOffer" label="ofr:name"/>
<node name="n2" predicateConstraint="rdf:type" objectConstraint="http://nazou.fiit.stuba.sk/nazou/ontologies/v0.6.17/region#Settlement" label="rdfs:label"/>
<node name="n3" predicateConstraint="rdf:type" objectConstraint="http://nazou.fiit.stuba.sk/nazou/ontologies/v0.6.17/classification#ProfessionClassification" label="rdfs:label"/>
</nodes>
<edges>
<edge srcNode="n1" predicateConstraint="jo:hasDutyLocation" destNode="n2" />
</edges>
An extracted graph is stored in a relational database in two tables:
§ Table nodes stores graph vertices and has following attributes:
o id – unique node ID;
o type – type of node;
o text – text label of node;
o uri –instance URI in ontology.
§ Table edges stores edges between nodes. Following attributes are present:
o start_id – ID of starting node;
o end_id – ID of ending node;
o weight – edge weight.
Result of the hierarchic clustering is stored in the same relational database where the tables nodes and edges are stored in. Following tables are used:
§ Table clusters wraps unifies IDs of nodes and cluster nodes:
o Attribute ID is unique ID;
o Attribute node_id is from table nodes.id, if the lowest level node are wrapped, otherwise NULL value is used to determine cluster node.
o Attribute text is textual annotation of a node.
§ Table cluster_hierarchy stores the hierarchic composition of clusters where attributes cluster_id and parent_id are bound together with following meaning: cluster with parent_id contains vertex cluster_id (which can be a single node or cluster). Note that parent_id contains one but usually more cluster_id’s, but cluster_id is only in one parent_id.
§ Table cluster_closure maps clusters to lowest level nodes (IDs of the nodes are in the table clusters) – see Fig. 3 (c).
Fig. 3. Different approaches to cluster hierarchy representation.
A.3 Development Manual
A.3.1 Tool Structure
Clusterer consists of following packages:
§ Clusterer runner (sk.fiit.nazou.clusterer)
§ Database connection (sk.fiit.nazou.common)
§ Graph extraction from OWL (sk.fiit.nazou.extraction)
§ Clustering methods (sk.fiit.nazou.methods)
§ Communication with ClusterNavigator (sk.fiit.nazou.server)
§ Persistence of clusters using relational DB (sk.fiit.nazou.storage)
A.3.2 Method Implementation
The process of hierarchic clustering in the ClustererRunner class is following:
Each clustering method must implement GraphClusterer interface (according to JUNG library) – this step usually consist only of implementation of the extract method, which has following declaration:
public ClusterSet extract(ArchetypeGraph graph)
Please refer to the source code of DummyGraphClusterer for a simple example of a clustering method.
A.4 Manual for Adaptation to Other Domains
Clusterer is capable to clusterize any input graph. According to the qualities of the input graph (e.g., number of vertices, graph connectivity degree, weighted/unweighted graph), adequate clustering method should be selected.
A.4.1 Configuring to Other Domain
Switching to other domain usually requires:
§ Reconfiguration of the graph extractor;
§ Determining proper clustering method and appropriate parameters of the method;
§ If different database than MySQL is used, database connector must be changed and some SQL statements may be incorrect.
A.4.2 Dependencies
Clusterer depends on MySQL. Ontology to graph extractor depends on Sesame, graph visualization uses ClusterNavigator.