A              RFTS – Rich full-text search

A.1               Basic Information

RFTS is a document indexing and full-text search tool using Boolean model. It was developed within NAZOU project. The motivation for implementing another search engine was to have an easily extendable and configurable document indexing tool to evaluate novel methods for information retrieval, documents statistical analysis and lemmatization and stemming methods for Slovak language.

A.1.1          Basic Terms

A.1.2          Method Description

RFTS uses Boolean model of IR systems. Detailed information about documents content is stored in the index structure, including the positions of the word in the documents, phrase number within the document. The words in tool's dictionary are kept in their basic form, morphological variants of the words are transformed to the basic form during the processing. Different stemming algorithms are used for documents in different languages. The tool exploit relational database to store all the information about documents and its contents.

From engineering point of view, it is worth to mention that RFTS functionality in conjunction with Corporate Memory (also developed within project NAZOU) can be accessed locally (using JAVA interfaces or command line tools) as well as remotely using Web Service interface. The remote access and Web Service interface allows easy integration of the RFTS indexing and search solution in other components and allows rapid prototyping of new tools that require full-text search or some form of statistical analysis of document collection.

A.1.3          Scenarios of Use

Tool is designed to index input text documents. The indexed collection of the documents ca be used for full-text search, for text mining operation and for retrieval of statistic information on documents. It can be used by end user (using user interface) or by an application (using application API).

Within NAZOU tool chain, RFTS is integrated with Ontea (tool for automatic semantic annotation) to determine the relevance of the candidate instances of ontological concepts identified by Ontea.

 

A.1.4          External Links and Publications

Laclavik M., Ciglan M., Seleng M., Hluchy L.: Empowering Automatic Semantic Annotation in Grid, to appear in proceedings of PPAM 07, Springer-Verlag, accepted

Laclavík M., Ciglan M., Šeleng M., Kkajčí S, Vojtek P., Hluchý L.: Semi-automatic Semantic Annotation of Slovak Texts, to be published in proceedings of SLOVKO'07, Fourth International Seminar on NLP, Computational Lexicography and Terminology

Ciglan M., Laclavík M., Šeleng M., Hluchý L.: Document indexing for automatic semantic annotation support, Proceedings of 9th international conference on informatics - Informatics 07, ISBN 978-80-969243-7-0, 2006

Ciglan M.: Documents Content Indexing for Supporting Knowledge Acquisition Tools. In: Tools for Acquisition, Organisation and Presenting of Information and Knowledge. P.Navrat et al. (Eds.), Vydavatelstvo STU, Bratislava, 2006, pp.101-104, ISBN 80-227-2468-8. Workshop 26-28 September, Nizke Tatry, Slovakia.

A.2               Integration Manual

RFTS is developed in Java (Standard Edition 5) and distributed as a source code archive with automatic build script. Access to the functionality of the tool is provided through Java Interface.

A.2.1          Dependencies

Libraries:

l  NAZOU Corporate Memory library

l  NAZOU ITG – NAZOU integration technology package

Libraries for Stemmer/Lemmatizers:

l  NAZOU Tvaroslovnik

l  JWNL - JWNL is an API for accessing WordNet-style relational dictionaries

l  SG-CDB – library for constant databases

A.2.2          Installation

Environmental variable NAZOU_HOME have to be set and pointing to a valid directory in hosting file system.(On Linux based, the environmenatal variable can by set by executing command 'export NAZOU_HOME=<path_to_instalation_directory>')

Unzip RFTS distribution 'RFTS.zip'.

RFTS uses Apach Ant for building and deploying the software. Apache Ant has to be installed and configured on the hosting system.

Build and deploy RFTS using command 'ant deploy' in RFTS distribution directory. The script will compile and deploy RFTS to $NAZOU_HOME directory.

In addition, the data files used by stemmer/lematization methods must be deployed in NAZOU Corporate Memory and the paths to those data files must be configured in the RFTS configuration.

A.2.3          Configuration

Tool configuration file is located in

 $NAZOU_HOME/conf/RFTS/rfts.properties

Configuration file contains list of properties specifying the utility that will be used to get the base for of words for different languages. E.g., in the following configuration tool Tvaroslovnik is used to get the word base form for Slovak language using configuration files in $NAZOU_HOME/RFTS/tvaroslovnik/

    stemmer.sk=nazou.dm.fullTextPo...exStorage.wordsBaseForm.Tvaroslovnik

    stemmer.sk.conf=/RFTS/tvaroslovnik/

Additional obligatory configuration property is rdb.resource, specifying the CM database resource that will be used by the tool; the default value is:

    rdb.resource=RFTS

A.2.4          Integration Guide

This section describes how to integrate RFTS with applications using Java API.

Indexing:

Class    nazou.dm.fullTextPositionsIndexStorage.PositionFTIndexer provides following methods for document indexing:

    indexDocument(String inFile, String lang)

·        index input document (from CM file storage) using setting for given language

    indexDocuments()

·        index all documents prepared by DocConverter tool

 

Full-text search:

Class   nazou.ds.fullTextSearch.fullTextSearch provides methods for full-text search. Following types of queries are supported in current implementation:

l  retrieve documents containing input words (AND operator)

l  retrieve documents containing at least one of input words (OR operator)

l  retrieve documents containing input words distant at most k positions from each other (relative position)

Cass nazou.ds.fullTextSearch.nazou.ds.fullTextSearch provides implementation of the statistical analysis required by Ontea tool.

Based on regular expression, Ontea identifies part of a text related to semantic context and match the subsequent sequence of characters to create an instance of the concept. Let us denote the sequence of words related to semantic context by C and word sequence identified as a candidate instance as I. We evaluate the relevance of the new instance by computing the ration of the close occurrence of C and I and occurrence of I:

close_occurrence(C, I) / occurrence(I)

RFST indexing tool provides us with enough functionality to retrieve required statistical values computed from the whole collection of documents stored in RFTS index structures.

Let COLL be a collection of the documents d1,..,dn:

                                  

 

Let d in COLL,, and  are the words from natural language. Function , where  , denotes the number of distinct word sequences of the length distance containing the words.

We compute the relevance of candidate instance as:

 

If the resulting relevance value exceeds defined threshold, the candidate word sequence I is considered to be a valid instance of the semantic concept related to sequence C. For the experimental evaluation of the approach, the threshold was set manually after inspecting the preliminary relevance values of the generated candidate instances.

A.3               Development Manual

This section describes internal structure of RFTS tool and method implementations.

A.3.1          Tool Structure

Tool is structured to the following sub-packages:

nazou.dm.fullTextPositionsIndexStorage – contains method for indexing collection of documents. It is integrated with the NAZOU Corporate Memory (CM) and get the information which documents in CM file storage are required to be indexed. This information is prepared by the DocConverter tool, indicating new and modified documents in the system.

nazou.dm.fullTextPositionsIndexStorage.wordsBaseForm – contains methods for transforming words of natural languages to their base form, using either stemming or lemmatization methods. Interface StemmerLemmatizer is defined, and all the methods for stemming/lemmatization must implement its defined methods to be interoperable with RFTS. Class SLHandler configure the word transformation implementations upon the start of RFTS and maintains the instances of the word transformation classes.

nazou.ds.fullTextSearch.fullTextSearch – contains methods for querying and analysis of the indexed collection of documents.

A.3.2          Method Implementation

Document indexing:

Class PositionFTIndexer from package nazou.dm.fullTextPositionsIndexStorage implements the document indexing. It comprises functionality for document preprocessing (tokenization and word transformations) and for storing the data in the database structure, where the data is indexed using standard database indexes.

Word transformation to base form:

Interface StemmerLemmatizer defines the methods that must be implemented by wrappers of the stemming/lemmatization methods. Class SLHandler configure the word transformation implementations upon the start of RFTS and maintains the instances of the word transformation classes.

nazou.ds.fullTextSearch.fullTextSearch – contains methods for querying and analysis of the indexed collection of documents. Each query/analysis type is implemented as a separate class to hide the complexity of the SQL queries and data processing from the application being integrated with RFTS. Implementations of query/analysis methods should by compliant with nazou.cm.client.db.templateQueries.RdbmsTemplateQuery interface.

Class fullTextSearch provides a wrapper that integrates the implementations of query/data analysis methods to provide a single point of interaction with RFTS query system.

A.4               Manual for Adaptation to Other Domains

RFTS is a generic tool and no customization/reconfiguration is required for adoption to other domains.