WebCrawler

Tool traverses Internet web documents and creates local copies.

Institution: Institute of Informatics
Technologies used: Java, MySQL
Inputs: URL of crawled Internet site
Outputs: Copy of downloaded documents and metadata stored in MySQL
Documentation: HTML, doc
Distribution packages: zip

Addressed Problems

Internet contains a lot of information spreaded all over the Internet mainly stored in HTML documents. Many systems which process Internet data, in particular search engines, employ tools know as web crawlers or web spiders to acquire and update specific heterogeneous information. Such tools employ methods know as web crawling or spidering for automated Internet data acquisition. Methods implemented in the WebCrawler tool enhance contemporary known crawlers’ architecture with method of dynamic link handling as the challenge of avoiding spider-traps (usually occurred in dynamic link processing). Proposed method employs ERID method to enable feature of focused crawling.

Description

The basic principle of web crawling method relays particularly on hypertext links which in fact forms the huge network known as Internet. Hypertext links are represented as URL (Unified Resource Locator) link which contains information about unique location referenced web resource.

Common web crawler (see Figure 1) implements method composed from following steps:

Common crawler block schema.

Standard web crawler architecture.

Web crawler method implements several basic policies:

The method proposes innovative light-weight solution for crawling dynamically generated pages using evaluation of parameters' importance on generated HTML content. Parameter evaluation process sequentially compares two pages downloaded using different combination of URL parameters and evaluates similarity ratio of downloaded document content. Such evaluation creates records only about useful parameters (similarity ratio passed over specified threshold) and excludes plenty of parameters from processing and therefore speed-up crawling process and avoid ambiguous pages. This crawling method also applies focused crawling feature which is based on relevancy estimation of downloaded document content using ERID tool. Web documents that have passed estimation threshold are stored for further processing.

Block schema of web crawler

Schema of WebCrawler tool

References

  1. Gatial E., Z. Balogh Z., Laclavik M., Ciglan M., Hluchy L.: Focused Web Crawling Mechanism based on Page Relevance. In: Proceedings of ITAT 2005 Information Technologies - Applications and Theory, Peter Vojtas (Ed.), Prirodovedecka fakulta Univerzity Pavla Jozefa Safarika v Kosiciach, 2005, pp.41-46. Slovakia, September 2005. ISBN 80-7097-609-8.
  2. Gatial E., Balogh Z.: Identifying, Retrieving and Determining Relevance of Heterogenous Internet Resources. In: Tools for Acquisition, Organisation and Presenting of Information and Knowledge. P.Navrat et al. (Eds.), Vydavatelstvo STU, Bratislava, 2006, pp.15-21, ISBN 80-227-2468-8. Workshop 26-28 September, Nizke Tatry, Slovakia.