A              ERID – Estimation of Relevance for Internet Documents

A.1          Basic Information

Internet documents vary in the topic and in the structure of the content. ERID employs method of single neuron (perceptron) that tries to evaluate the rate of the Internet document relevance using the defined keywords (expressed by regular expressions) as features of neuron input. Such perceptron is trained on the set of web pages well describing the required domain to set-up desired behavior. Such rate is used by the WebCrawler tool to decide whether to create the copy of crawled Internet document or not.

A.1.1      Basic Terms

HTML – HyperText Markup Language

TLU – Threshold Logic Unit

A.1.2      Method Description

ERID tool employs a basic method of single neuron (perceptron) well known in neural network community. The most important feature of neuron includes the possibility of training according to the input set. In general, the input sets are set of features gained from input data and can vary in many ways. The main requirement on the feature of input set is its measurability which means that the feature must be transformable in some way into numerical value called input vector. It’s up to programmer how the input set will be measured and evaluated.

The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.

The perceptron is composed of xj denotes j-th item in the input vector, wj denotes j-th item in the weight vector (feature’s importance factor), a denotes the weighted sum of dot product of input vector and weight vector, b is threshold and y is output from vector.

Figure 1: Schema of perceptron.

A.1.3      Scenarios of Use

The ERID tool can be used for any categorization purposes, where the features of categorization object can be identified and used as input vector.

A.1.4      External Links and Publications

A.2          Integration Manual

ERID tool is developed in Java SE 1.5 and distributed as jar package. The objects of ERID are accessible via Java interface. ERID tool is customizable by setting-up keyword matcher implemented using regular expressions.

A.2.1      Dependencies

ERID uses following external libraries:

-        Corporate memory libraries: file (Corporate memory libraries) libraries developed within the scope of NAZOU project.

-        HtmlParser library (http://sourceforge.net/projects/htmlparser, version 2.0)

-        Log4j library (http://logging.apache.org/log4j/, version 1.2.8+)

-        JUnit library (http://sourceforge.net/projects/junit/, version 4.3+)

A.2.2      Installation

Installation requires the Apache Ant tool and Java SE 1.5 to be installed. The installation procedure takes following steps:

  1. Download ERID source code.

2.      Run shell command ant jar to build jar file.

A.2.3      Configuration

  1. To setup ERID keywords and weights, edit file build.properties and set parameters

-        page.dir: the path to directory where categorized HTML files are stored. The ERID expects files that are prefixed by “yes” to fall in the category and files prefixed by “no” not to fall into the category. This is important for training and evaluation of ERID,

-        stopword.properites: the path to the file containing the list of words which are excluded from the evaluation,

-        keyword.properties: the path to the file that contains weight setup. File holds pairs of keyword (can be written as regular expression) and weight that are loaded during ERID initialization. The initial weight can be set to the random value. This property shouldn’t be set in case of running find-candidates task.

2.      Run shell command ant find-candidates to return the most frequently used keywords in the training set.

3.      Run shell command ant train to train ERID tool upon the given training set. By default iteration limit value is set to 1000 loops and learning rate to 0.005. Limit value and learning rate can be changed in Evaluate java class.

4.      Run shell command ant evaluate to check the train progress. Training process (step 3) can be repeated till acceptable result is returned.

A.2.4      Integration Guide

Tool can be used via Java interfaces of ERID and TLU implementations:

Figure 2: Tlu and Erid interface.

ERID tool employs TLU interface to access methods for perceptron initialization, training and evaluation of input set. ERID interface is used for setting keywords (features of document), training TLU by processing content of pre-categorized web pages and evaluation of category according given web page content. See JavaDoc for more information about methods and their parameters.

Creating training set

  1. Establishing unambiguous categorization rules. For example, to categorize the web pages falling into domain of job offers, the web page must contain at least information about work position name, work position location, offered salary and requirements, moreover the page cannot contain multiple job offers on the page, work position must have contact person, etc.
  2. Creating copies of human categorized web pages’ contents. In case of categorization of job offers, a user (a person who trains ERID tool) should visit various job offer portals and randomly download pages. The user must mark web pages containing job offer (the pages fulfilling the rules) with prefix “yes”, otherwise mark web page with prefix “no”. These web page copies should reside in single directory.

Selecting proper keywords (features)

The user can intuitively assess the keywords matching required domain. Selected keywords along with categorized training set definitely specify used domain for ERID. User can run find-candidates task (see Step 2 in section A.2.3 Configuration) to review the most used keywords in the pages’ content of training set and choose only that keywords regarding the domain.

Training TLU

TLU is trained by running train task (see Step 3 in section A.2.3 Configuration). ERID tool reads content of web pages located in directory specified by page.dir property and creates list on input vectors and expected result vector to run method Tlu.train. This method adjusts weights for particular keyword and stores them into file specified by keyword.properties.

Development Manual

A.2.5      Tool Structure

ERID tool is composed of the following classes:

TluImpl class – provides simple perceptron implementation. It exposes methods for setting/getting weights, training weights and evaluation of input set.

EridImpl class – provides implementation of keyword-based features working with HTML pages.

WordStat class – helps during the keyword selection.

Configuration class and Evaluation class – provide methods for computing of initial settings and evaluation of trained TLU.

Figure 3: ERID tool block schema.

A.2.6      Method Implementation

ERID tool employs method of single neuron (described in section A.1.2 Method Description) for evaluation of input set (web page contents). Neuron method is implemented in class Tlu. The Erid class counts the word occurrences and employs single Tlu object (properly initialized) for evaluation the web page content domain.

Figure 4: Dependency diagram for Erid interface.

Figure 5: Dependency diagram for Tlu interface.

A.2.7      Enhancements and Optimizing

ERID tool can be enhanced by implementing multi-level TLUs for more precise domain categorization. TLU can be combined using logical operators or using output of TLU as an input for another TLU.

A.3          Manual for Adaptation to Other Domains

A.3.1      Configuring to Other Domain

ERID tool can be adapted to other domains by customizing training (described in paragraph “Creating training set” in section A.2.4 Integration guide) set and proper keyword selection (described in paragraph “Selecting proper keywords” in section A.2.4 Integration guide).