A              NALIT – Natural Language Identification Tool

A.1          Basic Information

Detection of a natural language in text documents is useful mainly in automated document processing. We put stress on the identification of natural languages such as Slovak, English or German, excluding artificial languages (e.g. C++, Java).

A.1.1      Basic Terms

Markov process

Stochastic process used as statistical text modeling method.

Markov chain

Discrete-time Markov process, used as synonym for Markov processes in the domain of natural language processing.

Language model (language profile)

Statistical model of language constructed from training text.

Training text

Text document representing a natural language (e.g. a novel, a set of newspaper articles).

A.1.2      Method Description

NALIT uses language identification method based on Markov processes proposed by Ted Dunning [Dunning, 1994]. This approach has several benefits:

§  Variable granularity of a text model can be achieved by changing the size of the text modeling tokens;

§  While chains of characters are used as the text modeling tokens, no linguistic preprocessing is needed (e.g. function word removal, stemming, lemmatization);

§  Statistical modeling of the text is applied. Any language model can be trained and then used in the identification phase (no linguistic qualities knowledge about this language is necessary).

Supervised learning is the core of the process of natural language model learning. Training text document is a representative portion of text written in particular language; a language model (profile) is created according to this text (Fig. 1).

Fig. 1. Creating a language model from training text.

Then, identification is the process of assignment of one of the learned language models to the input text document (Fig. 2).

Fig. 2. Language identification – selecting the best matching language model.

A.1.3      Scenarios of Use

NALIT can be used in the following scenarios:

§  An input text document contains only plain text.

NALIT should not be used in following cases:

§  A document to be identified contains more than one language (e.g. half of the document is written in English and half in German);

§  The input documents are too short (few words only);

§  Character encoding of the input text is not known.

A.1.4      External Links and Publications

§  Original language identification method based on Markov processes:
Dunning. T.: Statistical Identification of Language. Computing Research Laboratory, New Mexico State University. Technical Report (1994).

§  Vojtek, P., Bieliková, M.: Comparing Natural Language Identification Methods based on Markov Processes. Slovko 2007. Fourth International Seminar. NLP, Computational Lexicography and Terminology. 25-27 October 2007 Bratislava, Slovakia.

§  Log4J, Java-based logging utility, Apache Software Foundation. (http://logging.apache.org/log4j)

§  ISO 639-2, Codes for the representation of names of languages, alpha-3 code. (http://www.loc.gov/standards/iso639-2/langhome.html)

§  IANA Character Codes. (http://www.iana.org/assignments/character-sets)

A.2          Integration Manual

NALIT is developed in Java (Standard Edition 6) and distributed as a jar archive. Access to the functionality of the tool is provided through Java Interface. NALIT is not a stand-alone application; the tool is proposed to be included in other application/tool, which will call the NALIT interface methods.

A.2.1      Dependencies

NALIT uses Log4J logging utility.

A.2.2      Installation

Deploying NALIT into other application requires the following steps (any Java Integrated Development Environment should be used):

1.    Two external jar archives must be included into existing project − the jar archive containing NALIT and Log4J jar archive.

2.    nalit.properties file must be included in the root directory of the project.

A.2.3      Configuration

The nalit.properties file contains the following values:

§  pathToLogFile − (relative) path to the file where Log4J stores log events. This textual log file is created automatically,

§  logFileSize − maximum log file size in kB (after reaching this limit, new log file is created).

A.2.4      Integration Guide

NALIT can be used in these ways:

§  To create a language model;

§  To identify natural language of a document:

o  Batch processing − language identification of text documents processed as Java io.File, designed to process many documents at once,

o  Stream processing − language identification from Java String directly, designed to process smaller texts, tuned for fast response in real-time applications.

Following notes should be useful when using NALIT:

§  Slash should complete a path pointing to a directory in a file system: “/”;

§  The tool always assigns a language to input document, for example when German text is at input and Slovak and English models are present in input model directory, the German text is labeled as written in Slovak or English language;

§  In a case, when an exceptional or error situation occurs, NalitException is thrown. Additional information describing why such an exception was thrown can be found in the log file;

§  Refer to the Additional Notes for help on selecting the proper Markov process size and accurate size of training text.

Creating a Language Model

Language model is a statistical representation of a language. The model is created from a training text, which represents a particular language.

Example of creating two language models, Slovak and English, is following:

String[] trainTexts = {"books/Sládkovič_Marína.txt", "books/Kerouac_On_the_road.txt"};

String[] langCodes = {"sk", "en"};

String outputDir = "save/my/models/here/";

Nalit nalit = new NalitImplementation();

nalit.createProfiles(trainTexts,langCodes,"UTF-8",outputDir, "dunning", 2);

Each language model is stored separately as a file, the naming convention for the file is following: [language]-[method]-[Markov_process_order].profile.

§  [language] is two-character language code (see ISO 639–2);

§  [method] is a method name shortcut. Actually, only one identification method is provided with shortcut  “dunning”;

§  [Markov_process_order] is a prefix size of the Markov process.

For example, language models created when using the example code above have following file names: sk-dunning-2.profile and en-dunning-2.profile.

Language Identification Batch Processing

Batch processing mode identifies language of all files in a selected directory. Example of use:

String pathToModels = "path_to_lang_models";

String inputDir = "path_to_text_files";

String encoding = "UTF-8";

Nalit nalit = new NalitImplementation();

 

LinkedList results =

nalit.identify(pathToModels, inputDir, encoding);

Here, pathToProfiles is the path to directory with relevant language models. The directory must contain only the language models. For example when a user needs to identify Slovak and English languages only, sk-dunning-2.profile and en-dunning-2.profile files are stored in pathToProfiles directory. inputDir is a path to directory where input text documents are stored and waiting to be identified. All these documents must have the same character encoding determined by a parameter encoding (see list of valid character encoding shortcuts according to IANA).

After the identification processes is accomplished, the results returned in LinkedList have following structure:

[path to text file no.1]

[language code assigned to the text file no.1]

[path to text file no.2]

[language code assigned to the text file no.2]

...

Language Identification Stream Processing

Stream processing mode is useful in tasks when text to be identified is inserted irregularly, text contains only few sentences or words and quick (real-time) response is required. Example of use of NALIT in stream processing is the following (note that the same method is called for identification as in batch processing):

String pathToModels = " path_to_lang_models";

String encoding = "String";

String inputText1 = "Aký je môj jazyk?";

String inputText2 = "What is my language?";

Nalit nalit = new NalitImplementation();

 

LinkedList result1 =

nalit.identify(pathToModels, inputText1, encoding);

 

LinkedList result2 =

nalit.identify(pathToModels, inputText2, encoding);

Note that the language models are loaded only once (at the first call of the identify method) and then the models are cached with the aim to adjust the performance. The parameter encoding must be set to “String”, determining that stream processing is used.

Results of the language identification are stored in a LinkedList containing only two entries. First value is the input text and the second value is the language code assigned to the text.

Additional Notes

Determining the proper order of Markov process in the process of language model creation is fundamental task. Generally, best results should be achieved with 1st, 2nd or 3rd levels Markov orders. Another aspect affecting the identification success rate is the size of the training text. Here, following rule should be considered – the larger the size of the input texts documents is, the smaller the training text size can be and vice versa.

For example, when documents of 1 kB size are expected on the input, the training text for each language model should be 100 kB. In contrast, when documents of 100 bytes size are expected on the input, the training text for each language model should be 500 kB. Note that these numbers also differ according to which languages are taken into account. When very similar languages are going to be identified, the models should be created from larger amount of training text.

A.3          Development Manual

A.3.1      Tool Structure

NALIT consists of following packages (structure and dependencies of the packages is displayed in Fig. 3):

§  Support for character encoding conversion (sk.fiit.nazou.nalit.convert_encoding);

§  Core of the identification method (sk.fiit.nazou.nalit.dunning);

§  Initialization of logging using Log4j library (sk.fiit.nazou.nalit.log4j_support);

§  Exceptions thrown by the tool (sk.fiit.nazou.nalit.nalit_exceptions);

§  Interface of NALIT and encapsulation of the sk.fiit.nazou.nalit.dunning package (sk.fiit.nazou.nalit.nalit_inteface);

§  Support methods for text preprocessing (sk.fiit.nazou.nalit.nalit_support).

Fig. 3. NALIT packages – structure and dependencies.

A.3.2      Method Implementation

Functionality of the language identification tool consists of two main processes: creating the language models and language identification.

Language model creation

The process of creating a new language model based on input training text document depicts Fig. 4. Instance of the class CreateHash reads the training text document (represented as File) using an inputFromFile method. Each time the inputFromFile method is called, floating window moves one character – the size of the floating window is determined by the size of Markov process order we want to use for modeling (order of Markov process = number of characters of floating window).

Each time a new Markov process is present, occurrence of this Markov process is updated in the MultiHashtable using the insertChain call (note that the term Markov chains is used instead of Markov processes in the domain of language processing). Markov processes and their occurrences are stored in the MultiHashtable. When the whole training text is processed, MultiHashtable is converted into ProbabilityHashtable. The conversion consists mainly of computing transition probabilities for each Markov process according to the number of occurrences of the process. Finally, the ProbabilityHashtable is stored persistently in the file system.

Fig. 4. Sequence diagram − creating a language model from training document.

Language identification

Fig. 5 displays the process of language identification of a document. Text document with yet unassigned language (represented as File) is loaded using readText method and stored internally as String. Markov processes are then extracted from the text using getMarkovChains method. If the loaded language model (ProbabilityHashtable)  already contains the extracted Markov processes, probability assigned to the Markov process is returned (actProbability method). All probabilities are summed together and normalized – the final computed value value represents how the input text document and particular language model are identical.

This process must be done for each model of language and then the most similar language model (i.e. language) is assigned to input text document.

Fig. 5. Sequence diagram − language identification.

A.3.3      Enhancements and Optimizing

The input document is loaded and stored as String in the language identification process, which is memory consuming when large doucments are on input. Loop processing using sliding window can optimize this process. Note that the process of creating a model is already optimized, while the learning documents are usually much larger than documents in the identification phase.

A.4          Manual for Adaptation to Other Domains

Language identification method involved in NALIT is universal classification method, capable to be used in many categorization tasks. NALIT was used in the following document categorization tasks:

§  Document author gender determination – answering the questing “Is a particular document written by a man or by a woman?”;

§  Document authorship identification – categorizing documents according to their authors;

§  Job-offer presence in documents – identification, whether a document contains a job-offer, on not.

No adaptations of the tool need to be made in such tasks.

More generally, the classification method is not restricted to text documents only but can be used to categorize any sequences of objects. In case of use in non-textual classification, changes to internal structure of NALIT tool are necessary while the data structures used by NALIT are designed to store the textual information only.

A.4.1      Configuring to Other Domain

When using NALIT in other text document categorization tasks, attention should be paid to the process of category model creation. In language identification a category is represented by particular natural language. For example in document authorship classification a category for each author must be created – usually a longer training text must be passed to create a category than in the language identification.

A.4.2      Dependencies

Log4J is involved domain independently into the NALIT.