ERID - Estimate Relevance of Internet Documents

Tool estimates the relevance of Internet document.

Institution: Institute of Informatics
Technologies used: Java, MySQL
Inputs: Content of Internet document.
Outputs: Rate of relevance estimation.
Documentation: HTML, doc, JavaDoc
Distribution packages: zip

Addressed Problems

Internet documents vary in the topic and in the structure of the content. ERID employs method of single neuron (perceptron) that tries to evaluate the rate of the Internet document relevance using the defined keywords (expressed by regular expressions) as features of neuron input. Such perceptron is trained on the set of web pages well describing the required domain to set-up desired behavior. Such rate is used by the WebCrawler tool deciding whether to create the copy of crawled Internet document.

Description

ERID tool implements method of single neuron (perceptron or threshold logic unit) that processes the pool of keyword patterns to estimate the page content relevance. Before estimation process can take place the perceptron must be trained to recognize target domain. Training process covers preparation of training set and selection of the most relevant keyword (including human intervention) for target domain. Generally, the training set contains two groups of web document samples; group of the most pertinent documents regarding target domain and group completely different. After successful training the perceptron is ready to estimate Internet document relevancy against target domain. Estimation process is based on page's keyword frequency rates and the trained keywords weights which serve as inputs for the perceptron which is then capable evaluate the web page binary target domain membership.

Schema of TLU (Threshold Logic Unit)

Schema of TLU (Threshold Logic Unit) provides the implementation of single perceptron. The Xi represents input keywords' frequencies, Wi represents initial weights for keywords, a represents weighted sum of Xi and Wi, y represents threshold rate, b represents binary estimation value.

Schema of ERID tool

Schema of ERID tool

References

  1. Gatial E., Balogh Z.: Identifying, Retrieving and Determining Relevance of Heterogenous Internet Resources. In: Tools for Acquisition, Organisation and Presenting of Information and Knowledge. P.Navrat et al. (Eds.), Vydavatelstvo STU, Bratislava, 2006, pp.15-21, ISBN 80-227-2468-8. Workshop 26-28 September, Nizke Tatry, Slovakia.
  2. Gatial E., Laclavik M., Balogh Z., Hluchy L.: Web Page Categorization in Process of Focused Crawling. In: Proceedings of the Ninth Internacional Conference on Informatics 2007, Ivan Plander (Ed.), Slovak Society for Applied Cybernetics and Informatics, Bratislava, Slovakia, 2007, pp.95-99, ISBN 978-80-969243-7-0.