A.1 Basic Information
In the full-text document information retrieval documents and words are stored in the so-called vector (or object-attribute) model, i.e. a rectangle table, fields of which storing (either binary or more complex) information about the presence of the word representing by a column in the document representing by a row of this table. In this process it is very important and useful to associate different forms of the same word to one group and to pick its suitable representant which can be the stem of word or its lemma.
Two languages which we are interesting in – Slovak and English – are totally different from this point of view. English used the declination in very simple way and hence have very few forms of its words. Rules of making word forms from the common basic form (usually the stem) are very well definable; the number of exceptions (e.g. irregular verbs) is relatively small. A usual method for this associating is a stemming implemented in known Porter's algorithm which is based on removing of prefixes and suffixes (like “-ed”, “-ing”, “-s”, “pre-“, and so on). On the other hand, Slovak (like all Slavic languages) belongs to the so-called inflected languages, where the great part of words have many (dozens of) forms which are made by hundred ways. Porter’s algorithm is not suitable in this case hence the task to replace it arises. Our tool Morphonary is a possible answer, because it provides a lemmatization, i.e. a looking for the lemma of a given word.
A.1.1 Basic Terms
A word in declined form.
Basic word form (e.g. the singular nominative in the case of a noun usually, the infinitive in the case of a verb).
A process of a looking for the lemma of a given word.
An equal part of two words starting in the end of them leftward. (It is not the same as a linguistic term suffix.)
A word form from database (together with its lemma) which helps to find the lemma of given word.
A.1.2 Method Description
The main idea used in our lemmatization of Slovak words is based on an observation that words with the same endings have usually the same declination hence word form of one word can be derived from yet known forms of the second word. So the crucial part of lemmatization from this point of view is to find a suitable guide for a given word. After that the common ending of the word and its guide is replaced by the corresponding ending of lemma of the guide and the lemma of the given word is obtained by the concatenation.
Emphasize that the notion of guide is not identical with the notion classical “school” patterns of the Slovak declination like “chlap”, “hrdina”, “dub”, “stroj”, and so on. Each Slovak (declinable) word can be a guide if it was put by someone to the list. The only important thing is its ending. Moreover, note that we do not respect the morphological structure of word. In fact, this structure is not substantial in the declination process.
If a given word is in the database directly as a lemma, its lemmatization is instantaneous because the database is in the cache memory in a suitable form. If not and it is necessary to look for guide, the lemmatization of a given word lasts about 5 ms. Hence it is able to work in a real time.
A.1.3 Scenarios of Use
Morphonary can be used in the following scenarios:
§ Single usage: The lemmatization of single Slovak word form.
§ Bag usage: The lemmatization of all words of a given Slovak text document.
Morphonary should not be used in following cases:
§ If source word or document is not in Slovak.
§ If user knows that all (or at least almost all) words in document are in the base form yet. Usage of Morphonary is disserviceable then.
§ Krajči S., Mati M., Novotný R.: Morphonary: A Slovak Language Dictionary, Tools for Acquisition, Organization and Presenting of Information and Knowledge, proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2006, ISBN 80-227-2468-8, pp. 162-165
§ Krajči S., Novotný R.: Hľadanie základného tvaru slovenského slova na základe spoločného konca slov, 1st Workshop on Intelligent and Knowledge oriented Technologies, M. Laclavík, I. Budinská, L. Hluchý (eds.), November 2006, Bratislava, Slovakia, pp. 99–101, [ISBN 978–80–969202–5–9]
§ Laclavík M., Ciglan M., Krajči S., Furdík K., Hluchý L.: Dostupné zdroje a výzvy pre počítačové spracovanie informačných zdrojov v slovenskom jazyku, 1st Workshop on Intelligent and Knowledge oriented Technologies, M. Laclavík, I. Budinská, L. Hluchý (eds.), November 2006, Bratislava, Slovakia, pp. 92–98, [ISBN 978–80–969202–5–9]
§ Krajči S., Novotný R.: Lemmatization of Slovak words by a tool Morphonary, Tools for Acquisition, Organization and Presenting of Information and Knowledge (2), proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2007, ISBN 978-80-227-2716-7, pp. 115-118
§ Laclavík M., Ciglan M., Šeleng M., Krajči S.: Ontea: Semi-automatic Pattern based Text Annotation empowered with Information Retrieval Methods, Tools for Acquisition, Organization and Presenting of Information and Knowledge (2), proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2007, ISBN 978-80-227-2716-7, pp. 119-129
A.2 Integration Manual
This section describes integration of Morphonary to other application. Morphonary is developed in Java SE 5 and its classes and methods can be imported from Java archive.
Full version of Morphonary uses following Java archives which must be included in build path:
§ commons-configuration-1.3.jar - generic configuration interface which enables an application to read configuration data
§ Nazou-ITG-2.1.jar – integration technology; only class NazouConfiguration is used
§ log4j-1.2.12.jar – logging utility
§ spring.jar – only JDBC template is used
§ mysql-connector-java-5.0.3-bin.jar – MySQL connector
But Morphonary can work fully in its light version and it is proposed to be integrated in this simplified form. It this case it use no Java archives, only three (serialized) data files:
§ SCS.dat – data from Slovník cudzích slov (Foreign word dictionary)
§ SSJ.dat – data from Slovník slovenského jazyka (Slovak word dictionary)
§ declined.dat – the list of possible guides
Installation of the (recommended) light version Morphonary is rather simple; it does not require special conditions. If root is the file system root of application then the source files should be in the folder root/src/tvaroslovnik and data files should be in the folder root/data.
The list of source files:
Morphonary does need a special configuration. The folder names are described in the section Installation.
A.2.4 User Guide
Morphonary is a command utility and it has no graphic interface. It can be used in the following scenarios:
§ Single usage: To call the method lemmatize of the class SlovakLemmatizer. Its single input is a given word; the output is its lemma.
§ Bag usage: To call the method lemmatizeFile of the class SlovakLemmatizer. It has (formally) two inputs – the names of input and output files.
A.3.1 Tool Structure
Morphonary has a simple architecture which corresponds to its intentional simplicity. Thus it is whole in one package.
A.3.2 Method Implementation
Both provided methods lemmatize and lemmatizeFile for lemmatization are contained in the file SlovakLemmatizer.java. The first one implemented the following algorithm; the second one is its derivation.
Let X is a given word.
1. Check the presence of X in the list of lemmas; if “yes” then return X itself and stop.
2. Check the presence of X in the list of guides; if “yes”' then return the lemma of guide X and stop.
3. Repeat the following:
a. Look for a guide Y with the same ending as the ending of X (as long as it can be). If its length is 0, then return null and stop.
b. Derive the (probably) lemma X'' of X by comparing with the (known) lemma of Y (details beneath).
c. Check the presence of X'' in the list of lemmas; if “yes” then return X'' and stop.
The process of the derivation of the probably lemma in the step 3b i.e. our “exchanging algorithm” can be illustrate on the following example: Let us assume that our task is to find the lemma of a word form “ponúk” and this word form is not present in the list of guides. In contrary, the word “ruka”' is present there including all its forms.
Of course, the word “oblúk”' can be a guide for “ponúk” too (if it is in the list of guides). But it leads to the probably lemma “ponúk” (because the lemma of “oblúk” is again “oblúk”) and the check in the point 3c fails.
This “exchanging algorithm” is implemented in the method deriveTheLemma of the class Word:
In concordance with this method a suitable data structure was chosen, namely TreeSet<Word> in the class WordTree. This class has three instantiations, one for each data file. Words in it are ordered with respect to the reverse word ordering (in the class ReverseWordComparator), hence then it is very easy e.g. to determine whether a given word is a lemma or not.
A.3.3 Enhancements and Optimizing
The successfulness of the determination true lemma of Slovak word form is about 90%. It can be said that considering ambiguity of natural language this number hardly can be 100%. Despite of this fact it is worth to consider some ways to raise it. One qualitative optimizing can be done by supplementing information about the word category (including the gender in the case of nouns) to the class Word describing a word form. Unsurprising quantitative optimizing is to add new guides to the database again and again. Maybe achievable (but time-consuming) ideal is to have all forms of all (codified) Slovak words.
Another possible enhancement is to allow ignoring Slovak diacritical accents, because certain types of documents (especially in an electronic form) do not use them.
A.4 Manual for Adaptation to Other Domains
Morphonary is a linguistic utility which does not depend on a domain in the sense of NAZOU project. But if a language is understood as a domain, this question is meaningful.
A.4.1 Configuring to Other Domain
This tool is clearly not suitable for non-inflected languages like English but it can be applied to other languages where declination means exactly the change of the ending of declined word, e.g. other Slavic languages. Then the algorithm remains the same, data files (probably including their names) should be changed, of course.
As it was said, Morphonary is independent on a domain in the sense of NAZOU project.