A              Top-k aggregator – Tool for finding user dependent top-k offers

A.1          Basic Information

Searching the best objects for users can be realized using complex user preferences. Top-k aggregator uses own indexes to optimize the search. The main advantages of the search are small number of disk accesses to find top-k objects, fast response time and support of complex user preferences.

In this tool we can use as input two forms of user preferences. First form consists of user preference to attribute values and monotone aggregation function to combine importance of object attributes. The second form uses fuzzy rules instead of aggregation function. The rules are usually a product of inductive learning made by tool IGAP. User preferences are usually acquired from user by tool UPreA.

A.1.1      Basic Terms

Local preference

Relates to one attribute, e.g. salary. It specifies which attribute values are preferred by user.

Global preference

Relates to all attributes. It provides a way to combine relevancies of objects obtained by user local preferences and get overall relevance of object.

Fuzzy set /

Fuzzy function

Local preference that maps every attribute value to a number from [0,1]. Higher number means higher relevance of this value. Such mapping can be viewed as a membership function of fuzzy set.

Aggregation function

Global preference, usually a weighted average. Aggregation function must be monotone in all its arguments.

Fuzzy rules

Another type of global preference. If object fulfills expressions on the right side of the rule (body), then the overall value of this object is at least the value on the left side of the rule (head).

B+ tree

Index structure used to organize orderable attribute e.g. salary, date, manager level

Dis-index

Index structure used to organize nominal (crisp) attribute like language, profession, industry

M-tree

Index structure used to organize metric attribute

Attribute source

Structure in which an attribute is stored. Usually an attribute source is an index structure, text file or a database table.

Object

Abstract term to something with attributes. In NAZOU project the term object can be substituted with a “job offer”

Sorted access

Access to an index structure to acquire next best object relatively to a local preference of a user

Random access

Access to an index structure to acquire attribute value for concrete object

A.1.2      Method Description

Top-k aggregator is a tool to search top-k objects using the algorithms from the family of Threshold algorithms proposed by Ronald Fagin. Having number of extensions we can use several forms of data storage, number of data sources and search algorithms. Main properties of the method are:

§  Each attribute is stored (indexed) separately in its own structure (attribute source).

§  Search algorithm does not access the whole attribute sources, just a part of each source. Accesses to attribute sources can be of two types only: sorted access and random access.

§  Index structures allow sorted access according to user local preferences without any reordering

Work with top-k aggregator can be divided to off-line and on-line phase. In the offline phase, the index creation is executed. In Top-k aggregator there is a support for index generation based on data stored in Sesame ontology. To express the attributes to extract from Sesame to the right structures we need to use a configuration file configtopk.xml (Figure 1).

Figure 1 Index preparation

In on-line phase we do the searching of top-k objects. The main search algorithm requires indexed attributes, local preferences (f1,…,fm), global preferences F (fuzzy rules or aggregation function) and number of retrieved objects k. As a result, the algorithm returns ordered list of objects with their overall value (Figure 2).

Figure 2 Searching of top-k objects

A.1.3      Scenarios of Use

Top-k aggregator can be used in the following scenarios:

§  Accessing indexed data (getting all attributes of specified object) – Typically used by tools UPreA and IGAP

§  Searching top-k objects based on local preferences and aggregation function usually acquired from UPreA

§  Searching top-k objects based on local preferences and fuzzy rules (typically learned by IGAP) usually acquired from UPreA

§  Sorting only a subset of objects in memory based on user preferences (both local and global)

Top-k aggregator should not be used in following cases:

§  Source data are not stored in Sesame (new index creator needed)

§  Domain size of any nominal attribute is very large i.e. almost every object have distinct value of a given attribute

§  Not monotone aggregation function is used. Many times it is possible to modify fuzzy sets and aggregation function to have monotone aggregation function with the same results. We do not have such a conversion method that can do it automatically.

A.1.4      External Links and Publications

§  Acquisition of user preferences and search of top-k objects: Gurský P., Horváth T., Novotný R., Vaneková V., Vojtáš P.: UPRE: User preference based search system. 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06), pp. 841-844, 2006

§  Detailed description of Top-k aggregator tool: Gurský P., Šumák M.: Top-k Aggregator. Proceedings Tools for Acquisition, Organisation and Presenting of Information and Knowledge, ISBN 80-227-2468-8, pages 115-124, 2006

§  Analysis of search methods with the support of local preferences: Gurský P.: Towards better semantics in the multifeature querying. Proceedings of Dateso 2006, ISBN 80-248-1025-5, pages 63-73, 2006

§  Detailed analysis of the core of threshold algorithms family: Fagin R., Lotem A., Naor M.: Optimal Aggregation Algorithms for Middleware. In Proc. 20th ACM Symposium on Principles of Database Systems, pages 102-113, 2001

A.2          Integration Manual

This section describes integration of Top-k aggregator to other applications. Top-k aggregator is developed in Java SE 5 and its classes and methods can be imported from Java archive.

A.2.1      Dependencies

Top-k aggregator uses following Java archives which must be included in build path:

§  commons-configuration-1.3.jar - generic configuration interface which enables an application to read configuration data

§  de-olikurt-parser-(serializable-patch).jar – tool for evaluation of mathematical expressions

§  igap.jar – tool IGAP used in IgapTopKFacade to work with integration of Top-k aggregator and IGAP tools

§  log4j-1.2.12.jar – logging utility

§  mysql-connector-java-5.0.3-bin.jar – MySQL connector

§  openrdf-model.jar – classes to work with sesame query result

§  sesame.jar – provides access to Sesame

§  spring.jar – only JDBC template is used

§  sumak.jar – useful classes to work with metric indexes

§  upjs-core.jar – core classes widely used in IGAP, UPreA and Top-k aggregator

Top-k aggregator needs Sesame at the offline phase only. MySQL need to be alive at both off-line and on-line time.

A.2.2      Installation

Installation of Top-k aggregator requires the following steps:

1.    Sesame server must be installed, configured and filled with domain ontology.

2.    All jar archives listed above plus topk.jar archive containing Top-k aggregator must be included into project.

3.    SQL script topk.sql must be executed in a new database in MySQL

4.    Files configtopk.xml and topk.properties should be checked and adapted as described in configuration session of this manual.

5.    The index should be created by executing uinf.wid.util.IndexCreation class

6.    Class IgapTopKFacade requires installed and configured tool IGAP. See integration manual of IGAP.

A.2.3      Configuration

Several properties files are necessary for proper working with Top-k tool. First requirement is to specify a root folder of the project in wid-home.properties.

For the access to Sesame we need to check sesame.properties file which influences access to Sesame database by following values:

§  sesame.url – URL of running Sesame server.

§  sesame.login, sesame.password – user name and password for Sesame.

§  sesame.repository – abbreviated name of Sesame repository.

§  sesame.owl, sesame.c, sesame.r, sesame.gu, sesame.jou, sesame.inst, sesame.ofr, sesame.jo - namespace prefixes starting with “<” and ending with “#”. If ontology schema changes, these namespaces must be updated.

§  sesame.a – full URI of rdf:type predicate, enclosed in “< >”.

§  sesame.instancePrefix – specifies which namespace is used for user instances.

Main configuration file for Top-k aggregator is topk.properties with following values:

§  topk.dbDriver – driver for MySQL database.

§  topk.dbAddressTopK – JDBC address to the database where topk.sql was executed.

§  topk.dbLoginTopK, topk.dbPasswordTopK – user name and password for MySQL.

§  topk.prefixForFiles – directory, where index files generated by Top-k aggregator will be stored.

§  topk.configXmlFile – file with the configuration for IndexPreparator class that generates indices from domain ontology stored in Sesame.

§  topk.inputXmlFile – file that specifies the index files (for testing purposes only).

The last important configuration file is the file specified in topk.configXmlFile property. Here we can specify several types of attributes we want to index. It is an XML file having tags with the following semantics:

§  <attributes> – root tag holding <attribute> tags only.

§  <attribute> - Each attribute has properties name and description. Name is the string that specifies the name of the attributes used in IndexHolder and it is the core name for index files. Each attribute can be one of these types (specified by type property):

o  fuzzy – attribute with string values in ontology but with orderable domain e.g. Education Level has values eElementary, …, ehPhD with natural ordering. In the <assignment> tag we can assign the string values to the numbers to and express the ordering explicitly.

o  computable – simple ordinal attribute or the attribute with complicated ordering based on several sub-values. Each sub-value can be a number or fuzzy. E.g. MinSalary is a combination of amount, currency and period. This attribute has an extra property “value” that specifies the overall value of the attribute for each object by algebraic expression. In the expression we can use number of operators, constants and variables. Variables are specified in the specification of sub-values in tag <column>

o  crisp – attribute without ordering with string values

o  timestamp – attribute of type timestamp

All attributes contain tag <select> that contains a select with object id (URI) in the first column and attribute values in the second and possible other columns (for computable attribute)

§  <column> - This tag is used for fuzzy and computable attributes. If the column stores orderable string values, the <column> tag contains <assignment> tags to assign strings to numbers. If <column> is in the computable attribute, it must contain properties “num” (the id of column), “variable”(char that can be used as a variable in the algebraic expression in attribute property “value”), “type” (can be number or fuzzy) and “null” (the value of the variable, when there is null entry in ontology)

§  <assignment> - Tag to assign one string value to a number. String value is specified in property “level” and number in property “evaluation”.

 

A.2.4      User Guide

Top-k aggregator provides two main functionalities:

§  Generating and providing the access to index structures.

§  Searching top-k objects based on local and global preferences.

Work with index

To generate new index we need to have domain ontology stored in Sesame, types of attributes specified in configuration XML file and database in MySQL with tables generated by topk.sql file.

Index holds information about original object id (URI in ontology), Instance of class ListOfAttributes serialized to the file and one of index structures for each attribute.

General class to access indices is IndexHolder. It allows create index, access to the default ListOfAttributes serialized to the file at the time of index creation, offers conversion from internal id (integer) and external id (URI) and vice versa, and finally it offers attribute values of object if internal/external id of object is known.

Whole process of creation of new indices can be done by execution of procedure:

uinf.wid.tools.topk.sources.IndexHolder.createIndex();

Such functionality has the executable uinf.wid.util.IndexCreation class.

Object ListOfAttributes stores both local and global user preferences. Every attribute in this list is either fuzzy (attribute values have some natural ordering) or crisp (no natural ordering). Fuzzy attribute contains FuzzySet object, crisp attribute contains a set of possible values. Every attribute has weight that can be used in weighted average as global preference. Using the following method we can get default ListOfAttributes object with fuzzy sets specified in XML configuration file.

ListOfAttributes listOfAttributes = IndexHolder.getListOfAttributes();

Conversion between external String id and internal integer id is captured by functions getStrId and getIntId:

String uri = IndexHolder.getStrId(intId);

int intId = IndexHolder.getIntId(uri);

If we need to work with attribute values of object we can use methods getValueFromTree for ordinal (fuzzy) attributes, getCrispValue for crisp values and getTimestampValue for timestamp attributes. If there was more than one value of attribute for the object, the functions return first of them in a select from XML configuration file. Each of the functions has 2 arguments: internal id of object and name of attribute defined in <attribute> tag of XML configuration file. Examples follow:

String eL = (String) IndexHolder.getValueFromTree(id, "EducationLevel"));

// returns e.g.“eSecondary“

Double salary = (Double) IndexHolder.getValueFromTree(id, "MinSalary"));

String place = IndexHolder.getCrispValue(id,"Place");

Timestamp ts = IndexHolder.getTimestampValue(idOffer, "AcquisitionDate");

Searching top-k objects

Search of top-k objects can be executed using TopKFacade class. Before calling of top-k search we can change two default settings.

The first default setting says that hard restrictions on attributes specified in fuzzy functions of fuzzyfications is enabled. Hard restriction is specified by zero fuzzy value of the domain value. If the hard restriction is not accomplished, the object has the overall value equal to zero i.e. irrelevant to be between top-k objects.

The second setting, disabled by default, says that algorithm continues the search in attributes where the minimal fuzzy value was reached. This setting should be enabled only if there is less than k objects in the result but more objects in the index. This setting can help to find new objects but with already known value.

TopKFacade topKFacade = new TopKFacade();

topKFacade.disableHardRescrictions();

topKFacade.setForceGoBehindDefault();

Top-k search can be evaluated with two different types of global preferences. If we want to use the weighted mean we should use:

List<RatedInstance> topk=topKFacade.getTopKObjekts(listOfAttributes, k);

If global preferences have a form of fuzzy rules we need to use:

List<RatedInstance> topk = topKFacade.getTopKObjekts(listOfAttributes, hypothesesTopK, k);

Both methods return a list of RatedInstance-s. RatedInstance is the interface with two methods getURI and getRating. The first method returns external String id (URI in ontology) and the second returns double value that is the overall value of the object. List is ordered by rating in descent order (better objects are the first).

If you want to use rules generated by tool IGAP, it is better to use class IgapTopKFacade that covers both tools IGAP and Top-k aggregator. Before calling top-k search we can change default settings as well as in TopKFacade.

IgapTopKFacade igapTopKFacade = new IgapTopKFacade(attributeDao);

igapTopKFacade.disableHardRescrictions();

igapTopKFacade.setForceGoBehindDefault();

List<RatedInstance> topk = igapTopKFacade.getTopKObjekts(ratedOffers, k);

Argument attributeDao is an instance of interface AttributeDao defined in Igap archive. It contains only one getter for ListOfAttribute object. Argument ratedOffers is a list of RatedInstance-s holding object URIs with user rating.

If we want to work with user preferences more intensively, the suitable facade is UpreaFacade provided by tool UPreA. See documentation of UPreA tool.

A.3          Developer Manual

A.3.1      Tool Structure

Top-k aggregator consists of the following packages (see Figure 4):

§  Common classes used also by IGAP and UPreA tools (uinf.wid.core) included in upjs-core.jar

§  Basic classes to cooperate with top-k aggregator (uinf.wid.tools.topk)

§  Alternative algorithms of top-k search (uinf.wid.tools.topk.algorithms);

§  Several types of aggregation functions (uinf.wid.tools.topk.evaluators);

§  Heuristics used by search algorithms (uinf.wid.tools.topk.heuristics);

§  Index structures for different types of attributes and index access classes (uinf.wid.tools.topk.sources);

Figure 4 Packages and dependencies

A.3.2      Method Implementation

Index creation

In the current implementation we have only one automatic index creator in class IndexPreparator. It can be used to fill index structures from data stored in Sesame. The configuration can be changed in XML file written in topk.configXmlFile property. The process of index creation follows the SAX parsing of the XML file. In XML we have one subtree for each attribute to be indexed. Based on attribute type, IndexPreparator chooses appropriate index structure (B+tree, Dis-index, MySQL table or M-tree). For each attribute IndexPreparator runs the Sesame select with result containing URIs of domain objects and attribute values. Then it fills index structure and stores the default local preferences of the attribute from XML file. After processing all the attributes from XML file, IndexPreparator serializes the ListOfAttributes object to a file.

Top-k search

For top-k search we can use any algorithm from uinf.wid.tools.topk.algorithms package. We prefer the use of 3P-NRA (3-phased no random access) implemented in AlgSortedAccessesNRA3 class.

The 3P-NRA algorithm works as follows.

Input: k, F, m sources ordered by f1,…,fm

Output: k ranked objects if exist

 

T=Æ, C=Æ

Phase1:

Do the sorted access in parallel to all sources to get records

<x1f1(x1)>…<xm, fm(xm)>.

For each xÎ compute W(x) and B(x).

If |T|<k then put x to T

Else If W(x) ≥ W(Tk) then

     put x to the right place in T; If x was in C, move Tk to C

If W(Tk) is ≥ threshold F(u1,…,um) goto Phase2

Else goto Phase1

Phase2:

For each object x Î C compute B(x);

If x is no more relevant (i.e. B(x)≤W(Tk)) remove it from C.

If |C|=0 return T and exit; otherwise goto Phase 3.

Phase3:

Do the sorted access in parallel to the sources for which there are

unknown values for objects in C and T.

For every object x seen under sorted access do

  If x Ï TÈC ignore it

  otherwise compute W(x) and B(x);

    If B(x)≤W(Tk) remove x from C

    If |C|=0 return T and exit

    If W(x)≥ W(Tk) then

      put x to the right place in T; If x was in C, move Tk to C

If W(Tk) increased or threshold F(u1,…,um) decreased during last

    Phase3 then a heuristic H can choose to go to Phase 2

  Otherwise repeat Phase 3

Let us describe the labels. Value m represents the number of attributes of objects or the number of sources. Given an object x=(x1,…,xm), and subset V(x)={i1, … , in} Î {1, ..., m} of known attributes xi1, …, xin of x, define WV(x) (or shortly W(x) if V is known from context) to be minimal (worst) possible value of the aggregation function F for the object x. Because we assume that F is monotone aggregation function, we can compute its value by substituting for each missing attribute the value 0. For example if V(x)={1, …, g} then WV(x)=F(f1(x1),…,fg(xg), 0, …,0), where f1,…,fm are user specified fuzzy predicates.

Analogously we can define maximal (best) possible value of the aggregation function F for object x as BV(x) (or shortly B(x) if V is known from context). Since we know that fuzzy values in the sorted lists are ordered in descending way, we can substitute for each missing value the corresponding value from the vector u = (u1,…,um), where u1,…,um are the last fuzzy values seen from each source. For example if V(x)={1, …, g} then BV(x)=F(f1(x1),…,fg(xg), ug+1,… ,um).

The real value of object x is W(x) ≤ F(f1(x1),…,fm(xm)) ≤ B(x). Note that unseen objects during the computation (no values are known) has B(x) = F(u1, …,um) and W(x) = F(0,…,0). On the other hand if we know all the values W(x) = B(x) = F(f1(x1),…,fm(xm)). The value F(u1, …,um) is well known as the threshold value.

Next we use the list T of objects ordered by worst value. The object in T with the smallest worst value is labeled as Tk. In the (unordered) set C we store the candidates with the worst value smaller or equal to W(Tk) but with the best value larger than W(Tk). These are the objects with a hope to get into T later. Objects in C we call the candidates.

Simulating the sorted access over index structure

The simulation of sorted access, using the local preference function, depends on attribute type. The most common attributes are ordinal ones. Typical structures for indexing the ordinal attributes are B-trees and B+ trees. In our implementation we have slightly modified the B+ tree having all data in the tree and leaves stored on the disk. Pages of leaf data are double linked and we can easily traverse them in either direction. The leaf traversal supported by B+ tree allows us to follow records (attribute values) from higher ranking to lower ranking. When a user defines a scoring function over the ordinal attribute, he/she actually defines the ordering of the domain.

We assume that we can find local maximums of the scoring function. Having partially linear scoring function, the search of local maximums is trivial. In the case of arbitrary scoring function in analytical form, we need to do some kind of additional analysis. Local maximums are the starting points of traversing the B+ tree. After delivering the local maximums, the whole scoring function is used as a black box.

Simulation works as follows.

Input:  -Scoring function f (as a black box),

        -Set X := {x: f(x) is a representative of each local maximum of f}

Output: -The next best object with its fuzzy value in the sorted output stream.

 

Function getNext()

If (first call) then

  For all xÎX do

      Traverse from the root of B+ tree to find neighbor records

        <a,xa>, <b,xb> in a leaf (or leafs) such that xaxxb

      Add triples [p(a), f(xa), left] and [p(b), f(xb), right] to T (if

          there were no such records on the left or on the right do not

          add the corresponding triple)

  [p(o), f(xo), directiono] := max(T).

  Return <o,f(xo)>.

Else

  If T=Æ return “no more objects”

  [p(o), f(xo), directiono] := max(T).(last returned object)

  Remove [p(o), f(xo), directiono] from T

  Traverse to the record <u,xu> defined by p(o) and directiono.

    (This can be done in the memory if the record was already read

    from the disk or we need to use the leaf traversal of B+ tree

    first.)

    If there is such a record

      If [p(u), f(xu), opposite(directiono)]ÎT (in local minimum)

        Replace it by [p(u), f(xu), null] in T

      Else add [p(u), f(xu), directiono] to the set T

    Else return getNext();

  [p(v), f(xv), directionv] := max(T).`

  If directionv=null remove [p(v), f(xv), directionv] from T

  Return <v,f(xv)>.

Algorithm simulating the sorted access over B+ tree is formally described in figure 3. Any triple [p(o), f(xo), directiono] in T represents the following: p(o) is the position of object o in a materialized leaf in the memory, f(xo) is the relevance of object o given by scoring function f and third value represents the direction of the next move after returning of the object to the master algorithm.

Function max(T) returns triple with highest score. If there are more triples with same score max(T) returns one of them with the lowest object id.

Figure 3 Simulation of sorted access over B+tree

Example: To illustrate the simulation of sorted access consider the situation on Figure 3. The fuzzy function has one local maximum: the interval <200,250>. When top-k algorithm calls for the first object by sorted access, we need to search the random value in the interval (e.g. 225) by hierarchical traversal. We find 2 neighbor objects around 225 i.e. add triples [10,1,right] and [1,1,left] to T. Seeing that the both fuzzy values are the same we return object with lower object id i.e. <1,1>. In further sorted access calls we will follow one of the gray arrows to get new objects to return. Next we traverse left for the object 3 with fuzzy value 0.9 but return better object 10 with fuzzy value 1.0. After the next sorted access call we follow the direction of the last returned object (i.e. to the right) and compute fuzzy value 1.0 of object 7. After the next sorted access call and computing fuzzy value 0.6 of object 4 we need to sent object 3 and traverse to the left in the next call.

Similar simulations works also over other structures (Dis-index for nominal attributes and M-tree for metric attributes)

Top-k aggregator is designed for several alternatives of top-k search. We can use different:

§  Attribute types with appropriate sources (indexes)

§  Global preferences (evaluators)

§  Local preferences (fuzzy sets, fuzzyfications, enumeration of values)

§  Algorithms

§  Heuristics in algorithms

Several alternatives of evaluators, heuristics, sources and algorithms can be seen on Figure 5. Default alternatives of algorithm, heuristic and evaluator can be changed in the uinf.wid.tools.topk.TopKBuilder class. Source is determined by index type.

Figure 5 Hierarchy of several classes

Figure 6 Simplified version of Top-k search

The process of searching top-k objects covers many object calls. Simplified communication between main objects is described on Figure 6. Scenarios are different for each combination of algorithm and heuristic.

A.3.3      Enhancements and Optimizing

The design of Top-k aggregator allows easy integration of new and better source (index) structures, algorithms, heuristics, local and global preferences. Nowadays there is need to cover hierarchical attributes. Complex analysis of a domain data can lead to suitable combination of all variable components.

A.4          Manual for Adaptation to Other Domains

Top-k aggregator tool can be easily adapted to other domain when data are stored in ontology and we can extract the attributes of objects using SeRQL language. Data stored in other structures need new index creator. We have tested domains of houses, hotels and job-offers so far.

A.4.1      Configuring to Other Domain

Main change should be done in configuration XML file. There we must define the attributes of objects in a new domain. Configuration XML file was described in section A.2.3. The properties defining the access to the Sesame and MySQL should be checked as well.  

A.4.2      Dependencies

All dependencies are domain independent.