Method for Natural Language Identification (tool NALIT)

NALIT identifies natural languages in which the documents are written.

Institution: Slovak University of Technology
Technologies used: Java
Inputs: Plain text documents, (X)HTML documents
Outputs: Documents associated with identified language
Documentation: HTML, doc, JavaDoc
Distribution packages: zip

Addressed Problems

World Wide Web contains text documents written in many natural languages. Computational processing of these documents often brings task to filter these document by their language. Even if the information about document language is enclosed ((X)HTML pages can hold language tag, HTTP header also cointains this information), we have no guarantee, that it is correct. Thereby we develop tool for independent identification of language based only on document text.

Description

NALIT creates statistical models of languages and compares this models with unknown document, trying to determine, which language model fits best to the unknown text. We use text classification method based on Markov processes. This method is designed to identify only language, not character encoding of the text.

Involvement of the Markov processes as text modeling tool brings several benefits:

Process of language identification

Language identification process - selecting the best matching language model.

References

  1. Vojtek, P., Bieliková M.: Comparing Natural Language Identification Methods based on Markov Processes. Slovko - International Seminar on Computer Treatment of Slavic and East European Languages, Bratislava. Accepted. (2007).
  2. Vojtek, P.: Natural Language Identification in the World Wide Web. In Bieliková, M., ed.: IIT.SRC 2006: Student Research Conference, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava (2006) 153–159.