Method for Natural Language Identification (tool NALIT)
NALIT identifies natural languages in which the documents are written.
Institution: | Slovak University of Technology |
Technologies used: | Java |
Inputs: | Plain text documents, (X)HTML documents |
Outputs: | Documents associated with identified language |
Documentation: | HTML, doc, JavaDoc |
Distribution packages: | zip |
Addressed Problems
World Wide Web contains text documents written in many natural languages. Computational processing of these documents often brings task to filter these document by their language. Even if the information about document language is enclosed ((X)HTML pages can hold language tag, HTTP header also cointains this information), we have no guarantee, that it is correct. Thereby we develop tool for independent identification of language based only on document text.
Description
NALIT creates statistical models of languages and compares this models with unknown document, trying to determine, which language model fits best to the unknown text. We use text classification method based on Markov processes. This method is designed to identify only language, not character encoding of the text.
Involvement of the Markov processes as text modeling tool brings several benefits:
- variable granularity of a text model can be achieved by changing the size of the text modeling tokens;
- while chains of characters are used as the text modeling tokens, no linguistic preprocessing is needed (e.g. function word removal, stemming, lemmatization);
- statistical modeling of the text is applied. Any language can be learned and then identified without knowledge about its linguistic qualities like grammatical rules.

Language identification process - selecting the best matching language model.
References
- Vojtek, P., Bieliková M.: Comparing Natural Language Identification Methods based on Markov Processes. Slovko - International Seminar on Computer Treatment of Slavic and East European Languages, Bratislava. Accepted. (2007).
- Vojtek, P.: Natural Language Identification in the World Wide Web. In Bieliková, M., ed.: IIT.SRC 2006: Student Research Conference, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava (2006) 153–159.