JDbSearch

Fulltext indexing and querying over document collection with a support of relational database.

Institution: University of P. J. Šafárik
Technologies used: Java, JDBC, SQL
Inputs: Document collection, user query
Outputs: Relevant ranked documents
Documentation: HTML, doc, JavaDoc
Distribution packages: zip

Addressed Problems

This tool deals with searching of documents relevant to user query. It encompasses two areas - a fulltext indexing and fulltext querying. The former allows indexing of arbitrary plain text data (without any semantical tags) into a relational database. The indexing process speeds up the further queries. The latter process represents retrieving the documents which are relevant to the user query, i. e. they have the high precision and recall.

Description

We provide a Java-based tool for indexing and querying process based on the vector space model, which is more precise than commonly used boolean model. We build an auxiliary structure - an inverted list - which contains the relations between terms and documents. This inverted list can be mapped onto the tables in the relational database, which then serves the role of an fulltext index.

The source of text data can be arbitrary. It is represented by an implementation of document provider. There are multiple out-of-box document providers: from the basic ones indexing file system documents, to the more complex, which index an ontology instances.

The user can than query the fulltext index in two modes, where the query terms are separated either by AND connective ("all"-query) or by OR query ("any"-query). These queries are transformed to quickly performed SQL commands, which provide the rank-ordered list of documents.

Both indexing and querying processes allow plugging of language modules, which preprocess either plaintext data or user queries. Currently we provide support for English preprocessing and Slovak lemmatization.

The two most important advantages of this tool is the possibility of indexing of big document collection (bigger than operational memory) and the time complexity, which is experimentally 20-50% time better than known algorithms (Frieder-Grosmann). This tool can be speed up again with distributional computing.

References

  1. Lencses, R.: Indexing for Information Retrieval System supported with Relational Database, Conference Sofsem 2005, Slovakia, January 2005, Proc. Vojtáš et al. (ed.): Sofsem 2005 Communications, Bratislava 2004, 81-90
  2. Lencses, R.: Dopytovanie v systéme zameranom na získavanie informácií s podporou relačnej databázy, Zborník Datakon 04, Brno, Česká republika, Masarykova univerzita, Brno, 2004, 271-280
  3. Krajči S., Novotný R.: Použitie lematizácie vo fulltextovom vyhľadávaní v slovenských dokumentoch. In: Babič, F., Paralič, J. (eds.): Proceedings of WIKT 2007 - 2nd Workshop on Intelligent and Knowledge Oriented Technologies, ISBN 978-80-89284-10-8, 147-152
  4. Lencses, R. Fulltext indexing and querying with a support of relational database, Tools for Acquisition, Organisation and Presenting of Information and Knowledge, proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2006, ISBN 80-227-2468-8,