A.1 Basic Information
A full-text indexing and querying is a typical problem solved in the area of Information Retrieval. Usually, the information is stored and represented in the form of documents. User's query is evaluated and returned as a set of relevant documents. There are several possibilities how to deal and represent the documents and user’s query and how to calculate the document relevance. The two most used theoretical models are Boolean model and vector model (the latter in many variations). Also, there are various approaches as how to store the document index. For example, the Apache Lucene project stores its data in the self-maintained file in the file-system. However, relational database systems (which are nowadays the single most used data storage method) seem to have many advantages. Therefore we can consider them as a primary storage for our document index.
This tool can work in two languages — English and Slovak. In the first it uses Porter’s algorithm for grouping all forms of the same word, in the second case it uses the method of the partner NAZOU tool Morphonary.
A.1.1 Basic Terms
A classical, smooth text without additional structure.
A word, more exactly its basic form (lemma).
A simple array of terms which can appear in the collection of documents. Moreover, each term in this list is associated with an extra list of documents, in which the given term appears.
A.1.2 Method Description
In order to have an efficient and fast access to the document index, a concept of inverted list is often used. This approach has the least time and space requirements amongst the other possibilities. We can describe the inverted list as a simple array of terms which can appear in the collection of documents. Moreover, each term in this list is associated with an extra list of documents, in which the given term appears (see Figure 1). Only basic morphemes are in the inverted list. The semantics will decrease only a little, but length of inverted list will decreases significantly.
Fig. 1. Overview of an inverted list.
Now we can propose the data model for our representation of the index in the relational database. There are three fundamental tables used, see Figure 2. The table Doc represents the concept of document. It contains a document identifier, name, actual document data — usually as an CLOB data — and an a norm of the vector, which represents this document. The table Term accommodates the term data — inverse document frequency, document frequency and the total count of occurrence of the particular term in the whole collection. The third table, TermDoc connects the two aforementioned tables. It holds the identifiers of all terms in the given document with number and weight of term id_term in the document id_doc. The process of indexation then fills the inverted list — and therefore the database tables.
Fig. 2. Class Diagram representing data model.
User's query is typically a Boolean question of two types:
§ an Any-query has terms delimited by OR’s
§ an All-query uses the AND’s.
The relational database, used as a backend for document index can be accessed by the SQL queries. However, we can fairly easily represent the user's queries in the SQL.
An Any-query can be mapped to a classical (i. e. SELECT-FROM-JOIN-WHERE) SQL query. An All-query is, however, slightly more complicated, because a classical query would be very slow (due to number of joins equal to the number of terms in the Boolean condition). However, we can use a query contained the nested query in the WHERE clause.
The use of the vector model has one huge advantage to the classical Boolean model — it allows us to calculate the relevance of found documents. This can be computed also by using standard SQL resources, i. e. by a query using some additional tables: Query contains terms of the query and their corresponding weight and DocQuery holds documents, which satisfy the given Boolean condition and their norm.
A.1.3 Scenarios of Use
JDBSearch has two functionalities which imply two different scenarios of usage:
§ Document index storage
§ Full-text querying (ANDs and ORs are supported).
JDBSearch supports two languages – English and Slovak.
JDBSearch can but should not be used for structured documents. Because it ignores the place of word in the sentence, it can not be used for word order dependent research (e.g. sentence analysis).
§ Lencses, R., Indexing for Information Retrieval System supported with Relational Database, Conference Sofsem 2005, Slovakia, January 2005, Proc. Vojtáš et al. (ed.): Sofsem 2005 Communications, Bratislava 2004, 81-90
§ Lencses, R., Dopytovanie v systéme zameranom na získavanie informácií s podporou relačnej databázy, Proceedings of Datakon 04, Brno, Czech Republic, Masaryk University, Brno, 2004, p. 271-280
§ Lencses, R.: Fulltext indexing and querying with a support of relational database, Tools for Acquisition, Organization and Presenting of Information and Knowledge, proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2006, ISBN 80-227-2468-8, pp. 105-110
A.2 Integration Manual
This section describes integration of JDBSearch to other application. The JDBSearch tool is developed in Java SE 5 and its classes and methods can be imported from Java archive.
JDBSearch uses following Java archives which must be included in build path and runtime path:
§ nazou-uinf-core.jar – contains the classes which provide the configuration facility
§ Nazou-ITG-2.1.jar – integration technology; only class NazouConfiguration is used
§ log4j-1.2.12.jar – logging utility
§ commons-logging-api.jar – an logging abstraction library
§ spring.jar – uses the JDBC support layer
§ sesame.jar – provides access to Sesame
§ openrdf-model.jar – dependency library for Sesame
§ openrdf-util.jar – dependency library for Sesame
§ rio.jar – dependency library for Sesame
§ mysql-connector-java-5.0.3-bin.jar – MySQL connector
This tool works with user ontology. It requires Sesame server up and running (not necessarily on the same computer) and filled with RDF data from user ontology.
A single property file is necessary for proper working with JDBSearch tool. The tool-jdbsearch.properties file should be placed into the META-INF directory in the CLASSPATH. This file is actually an Spring application context description file that follows the Spring Properties Beans Definition format. This configuration file defines three Spring beans and configures them in the usual manner.
The important configuration settings are:
§ jdbsearchDataSource.user – specifies a user for the connection to the database
§ jdbsearchDataSource.password – specifies a password for the connection to the database
§ jdbsearchDataSource.serverName– specifies a hostname for the connection to the database
§ jdbsearchDataSource.databaseName – specifies a database name which will be used for the connection
§ jdbsearchIndex.minFreeMemory – specifies a minimal available free memory limit, which is allowed for a single indexing batch.
§ jdbsearchIndex.maxSize – specifies a maximum size of collection in bytes for documents in a single indexing batch.
§ jdbsearchIndex.documentTermsFile – specifies a writeable file for the temporary file used for batch load.
§ jdbsearchIndex.termsFile – specifies a writeable file for the temporary file used for batch load.
Furthermore, JDBSearch requires the creation of the SQL schema in the MySQL database. The schema can be created by using the provided DDL script jdbsearch.sql.
Installation of JDBSearch requires the following steps:
1. All jar archives listed above and the JDBSearch JAR archive must be included into project.
2. Configuration files must be present in the META-INF directory in the CLASSPATH.
3. A database schema is prepared in the MySQL database and all necessary tables are created.
A.2.4 User Guide
The general usage of JDBSearch consists of the two distinct parts: the indexing process and the querying. Both of these goals can be achieved in various ways – by running the tools from the commandline or by using the programmatic access.
The most usual use-case for this tool consists of indexing the single filesystem directory with multiple files which correspond to the documents. This can be easily done by using the command line utility.
The class Indexer can be executed from the commandline in the usual Java manner. This class takes a single parameter - a directory, which will be traversed for the documents and reindexed.
For example, the following command line
reindexes the files in the /home/jdbsearch/data directory. All documents are treated as plaintext documents and are indexed into the database configured in the tool-jdbsearch.properties file.
The querying process can be executed from the command line, too. This task can be handled by using the Query class. The following command line
queries the database for the terms java php sql using the AND conjunction as a separator. The alternate allowed separator is OR.
An alternate way to indexing and querying process is the direct use of provided classes. There are two front-end classes which can be used for the programmatic approach.
The indexation process can be achieved by instantiating the DocumentIndexStorage class and by its proper configuration. The appropriate setters, which need to be set, correspond to the properties of the jdbsearchIndex bean specified in the tool-jdbsearch.properties section.
Moreover, this class requires a configured javax.sql.DataSource implementation, which provides the access to the MySQL database. This implementation can be set on the DocumentIndexStorage via the appropriate setter.
This class supports the various sources for the document data. Most often, the tool should index the plaintext files in the directory. This is represented by the FileSystemDocumentProvider class, which requires the document directory in the constructor. An instance of this object can be set on the DocumentIndexStorage via setter.
Finally, the method writeIndex() can be used to run the actual indexing process.
The programmatic querying is more simple than indexing. It can be achieved by instantiating the FulltextQuery class and by providing a suitable javax.sql.DataSource implementation.
The actual programmatic querying can be achieved by calling one of the following methods:
These methods perform the search, in which the terms are separated by AND or OR conjunction. Both methods take a query String for the parameter.
All methods return a list of Result objects, which contain the document internal ID and metadata, which were assigned to the particular document in the indexation process. Besides that, the result object contains the weight representing a relevance of result.
The third, similarity search method seeks for similar documents to the document specified by its internal ID. (This internal ID can be retrieved by calling the getDocumentIdsByMetadata() method on the FulltextQuery instance. This method returns all internal document IDs for documents, which contain the specified metadata.)
The configuration of the indexing and querying parts can be performed by using the fact, that the both DocumentIndexStorage and FulltextQuery are configured as Spring beans in the tool-jdbsearch.properties file.
One can instantiate a application context and configure it via given property files. However this task is already implemented in the WidApplicationContext class. This class automatically discovers the configuration files and instantiates all configured beans. This radically simplifies the programmatic access:
A.3.1 Tool Structure
The JdbSearch tool implementation is diversified into two core and an integration packages, according to the natural indexing-querying structure.
§ jdbsearch.index – contains the necessary classes for the indexation process and all core classes (documents, document providers etc.)
§ jdbsearch.querying – contains the classes providing the querying services and necessary auxiliary classes like Result objects
§ jdbsearch.integration – provides the utility classes which are useful in the integration process (facades for indexation, additional document providers etc.)
A.3.2 Method Implementation
The implementation of document indexing process is mainly incorporated in the DocumentIndexStorage class. A schema of data flow is depicted below:
The fundamental notion in the indexing process is a Document. Basically, it is a sequence of characters which is uniquely identified by ID and can have multiple metadata. Documents (and their data) can appear in various forms. A typical example is a document based on a plaintext data, which are read from the filesystem – this is represented by the class FileSystemDocument. However, there are other possibilities, i. e. documents which correspond to the values in the database rows or documents which wrap ontology data.
The document index storage then retrieves a list of documents and processes their textual data.
This list can be retrieved from various sources, which are represented by the document providers (class DocumentProvider). There are various document providers:
§ FileSystemDocumentProvider reads the text files from a filesystem directory. Each file is wrapped as a FileSystemDocument and text data are retrieved lazily from the file
§ MySqlRowDocumentProvider reads the document data from the table rows which result from a particular SQL select.
The actual indexing process (invoked via writeIndex() method) then consists of the following phases:
§ retrieve list of documents and traverse it. Save each document into the database table Document. The actual implementation can be found in the private writeDocumentsToTable() method.
§ create an empty inverted list, which will contain the references between terms and documents (see the class InvertedList). The inverted list is implemented as a mapping from term strings to extended terms (class HugeTerm), where the extended term contain the list of documents in which does the term string occur and the number of occurences.
§ retrieve the text data from each document, tokenize it, process it via a suitable Lemmatizer implementation and contribute each term to the inverted list (see generateInvertedList() method).
§ if all documents have been processed, traverse the inverted list and generate two auxiliary text files, which will contain data for the batch load into the database.
o the file specified in the variable documentTermsFile contains the data for the table, which contains the occurences of the terms in the document.
o the file specified in the variable termsFile contains the data for the table, which contains the terms and their weights.
Both files are created and written to in the methods writeTermDocToFile() and writeTermsToFile().
§ when the auxiliary files are created, load the data into the database by using the appropriate batch load facility. This is done in methods callLoaderTerms() (for terms file) and callLoader() (for the occurrences).
§ recalculate term occurrence weights in the tables and calculate the tf and idf weights (see method writeWeightsTermToTable()).
The document querying implementation is implemented in the FulltextQuery class. This class provides methods for querying mentioned in the previous sections. The querying process consists of multiple phases:
§ take an input query string and tokenize it into list of terms (see method constructStrings())
§ take a list of terms and perform the stemming and lemmatizing process. This uses an associated implementation of Lemmatizer interface.
§ construct a SQL query that corresponds to the given query and query type. The All-query is constructed via constructQueryDocTermAnd() and the Any-Query is composed in the constructQueryDocTermOr().
§ execute the constructed query and map each returned row from the result set on a Result object, which contains a document ID and metadata and relevance (methods queryDbOr() and queryDbAnd())
A.3.3 Enhancements and Optimizing
There are various possibilities for enhancement of indexation process. For example, the actual insertion of documents into the database can be possibly done in batches. Moreover, a generalization and customization for different database engine can be achieved by customizing the created SQL queries in both indexing and querying classes.
Besides that, the actual indexing process can be easily paralelized. Each of the individual indexing nodes can index one particular batch of documents. One of the nodes can be then chosen to perform the document and term weight recalculation.
A.4 Manual for Adaptation to Other Domains
The methods and algorithms for JDBSearch tool are generally domain independent. The only important part which needs to be changed is the dependency on the natural language of the documents. Choosing an another language determines a different process for term stemming and lemmatizing
A.4.1 Configuring to Other Domain
The actual stemming process is achieved via Lemmatizer interface. The default implementation is based on the traditional Porter stemmer. However, it is possible to provide an another implementation – for example a Slovak stemmer based on the Morphonary tool.
The lemmatizer implementation can be set on the DocumentIndexStorage object via the setter method.
Besides this, no further customization for the different application domain is necessary.
No change is required when adapting this tool for the different domain. If an another implementation of the Lemmatizer interface is provided, it should be placed in the classpath in the traditional way.