Method for Lemmatization of a Slovak Word Form (tool Morphonary)

Method finds the lemma of a given Slovak word.

Institution: Šafárik University, Košice
Technologies used: Java, MySQL
Inputs: Word form
Outputs: Lemma of a given word form.
Documentation: HTML, doc, JavaDoc

Addressed Problems

In the full-text document information retrieval documents and words are stored in the so-called vector (or object-attribute) model, i.e. a rectangle table, fields of which storing (either binary or more complex) information about the presence of the word representing by a column in the document representing by a row of this table. In this process it is very important and useful to associate different forms of the same word to one group and to pick its suitable representant, namely its lemma is a very good candidate.

Description

Process of the Morphonary lemmatization of a given word form can be expressed by the following pseudo-code. X is an input word form, the algorithms returns the lemma (or at least its approximation). Moreover we have:

Pseudo-code:
  1. check the presence of X in the list of lemmas;
    if "yes" then return X itself and stop
  2. check the presence of X in the list of guides;
    if "yes" then return the lemma of guide X and stop
  3. repeat the following:
    1. look for a guide Y with the same ending as the ending of X (as long as it can be)
    2. derive the (probably) lemma X' of X by comparing with the (known) lemma of Y
    3. check the presence of X' in the list of lemmas;
      if "yes" then return X' and stop

References

  1. Krajči S., Mati M., Novotný R.: Morphonary: A Slovak Language Dictionary, Tools for Acquisition, Organisation and Presenting of Information and Knowledge, proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2006, ISBN 80-227-2468-8, pp. 162-165
  2. Krajči S., Novotný R.: Hľadanie základného tvaru slovenského slova na základe spoločného konca slov, 1st Workshop on Intelligent and Knowledge oriented Technologies, M. Laclavík, I. Budinská, L. Hluchý (eds.), November 2006, Bratislava, Slovakia, pp. 99–101, [ISBN 978–80–969202–5–9]
  3. Krajči S., Novotný R.: Lemmatization of Slovak words by a tool Morphonary, Tools for Acquisition, Organisation and Presenting of Information and Knowledge (2), proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2007, ISBN 978-80-227-2716-7, pp. 115-118
  4. Laclavík M., Ciglan M., Šeleng M., Krajči S.: Ontea: Semi-automatic Pattern based Text Annotation empowered with Information Retrieval Methods, Tools for Acquisition, Organisation and Presenting of Information and Knowledge (2), proceedings in Informatics and Information Technologies, P. Návrat, P. Bartoš, M. Bieliková, L. Hluchý, and P. Vojtáš (eds.), Vydavateľstvo STU, 2007, ISBN 978-80-227-2716-7, pp. 119-129
  5. Krajči S., Novotný R., Turlíková L.: Použitie lematizácie vo fulltextovom vyhľadávaní v slovenských dokumentoch, 2nd Workshop on Intelligent and Knowledge oriented Technologies, F. Babič, J. Paralič (eds.), November 2007, Košice, Slovakia, pp. 147–152, [ISBN 978–80–89284–10–8]
  6. Laclavík M., Ciglan M., Šeleng M., Krajči S., Vojtek P., Hluchý L.: Semi-automatic semantic annotation of Slovak Texts, SLOVKO 2007, Computer treatment of Slavic and East European languages. Bratislava: Slovak National Corpus, L. Stur Institute of Linguistics Slovak Academy of Sciences. ISBN 978-80-87139-05-9, s. 126-138