A              WebCrawler – WebCrawler

A.1          Basic Information

Internet contains a lot of information spreaded all over the world mainly stored in HTML documents. Such documents differ in the content structure and the topic. The WebCrawler tool tries to traverse the Internet according to the hypertext links found in web documents and store its copies to the local cache storage. Moreover, it creates the copies only of those web documents which have passed the estimation of the content topic.

A.1.1      Basic Terms

URL – Uniform Resource Locator

HTML – HyperText Markup Language

A.1.2      Method Description

Method of web crawling or web spidering usually employs automatic computer programs called bots or software agents. The method requires the set of starting URL addresses and ending condition. The method is based on the downloading of a page from Internet, identification of hypertext links to other pages and repetitive processing of unprocessed links. During the crawling process the records about processed and unprocessed links are created. The copies of downloaded pages are temporarily stored for further processing.

Several approaches were identified for page downloading, which alters behavior of web crawler:

-        Selection policy: This kind of policy specifies the searching strategy of unprocessed links. Most frequently used strategies are breadth-first search, evaluation of back-link counts and PageRank algorithm. Developed algorithm uses breath-first search.

-        Revisit policy: The base feature of Internet is the dynamicity; therefore it is appropriate to setup the time interval between processing of same link for updating purposes.

-        Politeness policy: Automated crawling agents are capable of requesting pages with very high rate therefore politeness policy avoids bottle-neck effect on crawled server.

-        Parallel processing policy: Multi-threaded page processing for large Internet crawling.

A.1.3      Scenarios of Use

WebCrawler tool can be used for retrieving the HTML pages from specific site.

A.1.4      External Links and Publications

A.2          Integration Manual

WebCrawler tool is developed using Java language and distributed as jar archive.

A.2.1      Dependencies

-        Corporate memory libraries: file (Corporate memory libraries) and database (Corporate memory libraries) libraries developed within the scope of NAZOU project.

-        ERID tool: developed within the scope of NAZOU project.

-        Log4j library (http://logging.apache.org/log4j/, version 1.2.8+)

-        HtmlParser library (http://sourceforge.net/projects/htmlparser, version 2.0)

A.2.2      Installation

Installation requires the Apache Ant tool and Java SE 1.5 to be installed. The installation procedure takes following steps:

  1. Download WebCrawler source code.

2.      Run shell command ant jar to build jar file.

A.2.3      Configuration

WebCrawler tool contains common configuration file for each component named webcrawler.properties. This file must be located in the root directory or java resource path (location of path is dependent on used classloader, usually it is located in the root path of jar archive or building directory).

Particular properties of webcrawler.properties are also well described in the same file.

 

webcrawler.properties - property description

cache.path

Full path of temporary file storage

metadata.url

URL specifying the metadata storage

analysis.url

URL specifying the analyzed data storage

max.concurrent.host

Max. number of concurrently processing hosts (default: 1)

max.threads.per.host

Max. number of processing threads per host (default: 1)

add.new.host.regex

Regular expression for adding new hosts into processing (default: *)

min_delay

Minimal delay before the downloading of next page for one processing thread (default: 500) [ms]

repository.class

Used repository class for storing data and metadata (default: nazou.da.webcrawler.FileRepository)

stop.page.counter

Limit number of processed web pages.

master.wait

Time of MasterNode refresh in milliseconds. [ms]

importance.threshold

Used by LinkAnalyzer class for important parameter evaluation.

 

Step-by-step configuration and run for local file system

  1. Parameter setup:

§  Setup valid parameter cache.path regarding local file system. Specified path will be used to temporarily store downloaded pages.

§  Setup valid parameters for metadata.url and analysis.url. This parametes should be prefixed using “file://” in case the data are store into file. The files must exist and be accessible.

§  Setup value of parameter repository.class to nazou.da.webcrawler.FileRepository or leave it commented.

§  Setup other parameters according your preference.

  1. Run shell command ant run to run tool.

Step-by-step configuration and run for Corporate Memory system

ToDo

A.2.4      Integration Guide

Tool implements system for automated Web page downloading and therefore its behavior can’t be affected during page processing by using direct method call from Java. This is the reason why WebCrawler doesn’t expose any Java interface for on-line tool management. Outputs of tool are accessed using shared database and file space where the files and metadata are available.

For the purpose of configurable storage the WebCrawler exposes CrawlerRepository Java interface (Figure 1) to customize the implementation of data storage. Configuration module loads properties from common property file, which allows creation of configured objects. Class must implement Configurable Java interface (Figure 1) to support configuration module.

Figure 1. WebCrawler exposed Java interfaces.

Integration of WebCrawler tool lies in the providing set of URLs which has to be processed and acquiring crawler output from data storage. Type of storage and structure of information depend on implementation of CrawlerRepository interface. There are 2 kinds of implementation, FileRepository class implements pure file storage and CMRepository class implements Corporative Memory storage interfaces (designated to Corporative Memory integration method). Another possible way of integration of WebCrawler output is to implement or extend CrawlerRepository interface to provide fully customized output.

A.3          Development Manual

A.3.1      Tool Structure

WebCrawler tool implements following classes and interfaces:

WebCrawler tool structure (in alphabetical order)

nazou.da.webcrawler.

CMRepository

Implementation of CrawlerRepository interface using Corporative Memory interfaces.

Configurable (interface)

Class extending this interface can be pre-configured using Configuration class.

Configuration

Allows creation of class instances with configured attributed according given property file.

CrawlerNode

Worker crawling node responsible for page downloading and link processing. Multiple instances of this class are usually running as independent threads.

CrawlerRepository

(interface)

Exposes common interface for implementation of data and metadata repository.

FileRepository

Implementation of CrawlerRepository interface using local file system as data and metadata storage.

LinkAnalyzer

Analyzes and evaluates URL parameter importance using tag comparison method.

MasterNode

Core implementation of crawling methods. Master node exposes getter/setter methods for URL processing and manages CrawlerNode threads. MasterNode class is running within thread.

Base tool dependency diagram is depicted in figure 2.

Figure 2.  Dependency diagram of CrawlerNode.

A.3.2      Method Implementation

WebCrawler tool employs methods of crawling (described in section A.1.2 Method Description) for downloading web pages and its content processing.

Properties of MasterNode class and Analyzer class are configured using Configuration class which employs methods of Java reflection to pre-configure newly created classes.

The core crawling method is implemented in MasterNode class responsible for URL link storage and CrawlerNode thread management. This class implements the base crawling policies (see section A.1.2 Method Description) easily configured using common property file.

Parallel downloading and URL link extraction is provided by CrawlerNode class which instances are running as separated threads. The number of running threads is dynamically managed by MasterNode.

Extracted URL links are processed using LinkAnalyzer class which analyzes parameter importance of unrecognized links or reconstruct recognized URL using only important parameters. Ratio of important parameters is evaluated by comparison of two HTML documents acquired by HTTP request with different parameters. The value of importance ratio is computed as number of different tags to number of all tags. The ratio of important parameter must pass over specified threshold.

A.3.3      Enhancements and Optimizing

WebCrawler optimizations can be done in several domains:

Multi-server processing

The shared synchronized directory has to be implemented in MasterNode. This optimization can be taken into account for massive Internet crawling systems.

Improved selection algorithm

WebCrawler uses very simple method for searching Internet links. Another data structure and searching engine must be built to implement mode sophisticated search algorithms (i.e. based on ideas of backlink-count or PageRank).

A.4          Manual for Adaptation to Other Domains

WebCrawler tool doesn’t require any implementation and configuration steps for adaptation to other domains.