Easily extendable tool for documents content indexing and rich full text search.
|Institution:||Slovak academy of sciences|
|Technologies used:||Java, mysql, ogsa-dai|
|Inputs:||Documents in plain text format|
|Outputs:||Documents indexes, full-text search query results|
|Documentation:||HTML, doc, JavaDoc|
This tool address the problem of fast, content based, identification of specific documents from a large collection of documents. The documents in text formats are indexed; fast full text search over indexed collection is then possible. The motivation for implementing another search engine was to have an easily extendable and configurable document indexing tool to evaluate novel methods for information retrieval, documents statistical analysis and lemmatization and stemming methods for Slovak language.
The tool consists of two logically separated parts - document indexing and full-text search
- document indexing - builds indexes of input documents, which are stored in mysql database. Detailed information about documents content is stored in the index structure, including the positions of the word in the documents, phrase number within the document. The words in tool's dictionary are kept in the basic form, different stemming algorithms are used for documents in different languages.
- full-text search - provides functionality for full-text search over a collection of documents. Positions of the words in documents (which are kept in the index structure) allow to execute search queries with relative words distance constraints. Following search conditions are provided: and condition, or condition, relative words distance
RFTS functionality in conjunction with Corporate Memory can be accessed locally (using JAVA interfaces or command line tools) as well as remotely using RPC calls or Web Service interface. The remote access and Web Service interface allows easy integration of the RFTS indexing and search solution in other components and allows rapid prototyping of new tools that require full-text search or some form of statistical analysis of document collection.
- CIGLAN, M. - LACLAVIK, Michal - SELENG, Martin - HLUCHY, Ladislav: Document indexing for automatic semantic annotation support. In INFORMATICS'2007 : proceedings of the ninth international conference on informatics. Bratislava : Slovak Society for Applied Cybernetics and Informatics, 2007. ISBN 978-80-969243-7-0, s. 163-169.
- Ciglan M.: Documents Content Indexing for Supporting Knowledge Acquisition Tools, In: Tools for Acquisition, Organisation and Presenting of Information and Knowledge. P.Navrat et al. (Eds.), Vydavatelstvo STU, Bratislava, 2006, pp.49-63, ISBN 80-227-2468-8. Workshop 29-30 September, Nizke Tatry, Slovakia. ITAT 2006, NAZOU Workshop, 26. 9 - 1. 10. 2006, Chata Kosodrevina, Bystrá dolina, Nízke Tatry, 2006