A              OfferMaintenance – Maintenance of Web Page Acquisition

A.1          Basic Information

OfferMaintenance reflects the needs for a simple user interface that allows easy access to downloaded web documents and enables to manage crawling and filtering tools. Since these tools can be run from command-line interface, the management of file and database storage along with crawling tools leads to inefficient work. The basic functionality of OfferMaintenance tools lies in the cooperation of ERID tool, WebCrawler tool and Corporate Memory (CM) developed under the scope of NAZOU project. OfferMaintenance works independently upon the mentioned tools using XML-RPC to connect WebCrawler tool and CM API to connect file and database storage.

A.1.1      Basic Terms

HTML – HyperText Markup Language

TLU – Threshold Logic Unit

CM – Corporate Memory

A.1.2      Method Description

OfferMaintenace tool provides user interface in order to enhance efficiency and usage of tools WebCrawler, ERID and CM. Designed and developed tool doesn’t invent any novel methods for maintenance, it only build application logic upon mentioned tools to enable:

A.1.3      Scenarios of Use

The scenario of OfferMaintenance tool usage is very straightforward and is derived from functionalities of WebCrawler and ERID tools. Moreover, provides simple user interface to manage crawled Web pages. The 3 base use scenarios were identified:

  1. WebCrawler Management

Requirements: The WebCrawler tool is installed on some server (this server must be resolved by DNS or must have static IP assigned) and is running. The WebCrawler tool accepts command from remote computer using XML-RPC messages (the firewall must be setup correctly to pass messages from client host and port). OfferMaintenance tool is installed on remote computer. CM is setup correctly.

Use Case: The administrator of CM wants to start crawling process.

Admin opens the OfferMaintenance user interface in a browser window and check the host and port of WebCrawler server in the “WebCrawler” tab. Status of WebCrawler is displayed. Administrator can start or stop crawling process.

  1. ERID Configuration

Requirements: The WebCrawler tool is installed on some server (this server must be resolved by DNS or must have static IP assigned) and is running. The WebCrawler tool accepts command from remote computer using XML-RPC messages (the firewall must be setup correctly to pass messages from client host and port). OfferMaintenance tool is installed on remote computer. CM is setup correctly. WebCrawler uses ERID for filtering acceptable pages.

Use Case: The administrator of CM wants to train ERID for more precise filtering.

Admin opens the OfferMaintenance user interface in a browser window and check the host and port of WebCrawler server in the “WebCrawler” tab. The ERID tool must be used. Admin must review downloaded pages and check acceptable (pages with suitable content) and unacceptable pages. Admin is able to list the words that influence the ERID filtering and decide which words are specific for desired domain and vice-versa. Next, admin can train ERID tool and review the results.

  1. Crawled Web Page Manipulation

Requirements: OfferMaintenance tool is installed on remote computer. CM is setup correctly.

Use Case: The administrator of CM wants to review crawled pages.

Admin opens the OfferMaintenance user interface in a browser window. In the “Maintenance” tab he is able to review the raw content of crawled HTML pages. According to content he can delete the page(s). Admin can list pages sorted by creation date and decide about their up-to-dateness.

A.1.4      External Links and Publications

A.2          Integration Manual

OfferMaintenance tool is developed in Java SE 1.5 and distributed as jar package. The tool is accessible from a browser.

A.2.1      Dependencies

ERID uses following external libraries:

-        Corporate memory libraries: file (Corporate memory libraries) libraries developed within the scope of NAZOU project.

-        ERID library: developed within the scope of NAZOU project.

-        WebCrawler library: developed within the scope of NAZOU project.

-        Log4j library (http://logging.apache.org/log4j/, version 1.2.8+)

-        GWT libraries: Google Web Toolkit libraries (http://code.google.com/webtoolkit/download.html)

-        JUnit library (http://sourceforge.net/projects/junit/, version 4.3+)

A.2.2      Installation

Installation requires the Apache Ant tool, Java SE 1.5 and web/application server (tested on Tomcat 5.5.20) to be installed. The installation procedure takes following steps:

  1. Download OfferMaintenance source code.
  2. Setup OfferMaintenance configuration file build.properties.

3.      Run shell command ant deploy to install OfferMaintenance tool.

A.2.3      Configuration

Configuration of tool is done in the section “A.2.2 Installation”.

A.2.4      Integration Guide

Since the purpose of the tool is to manage ERID, WebCrawler and CM tools, it is not convenient (even not reasonable) to extend its functionality on API level.

As a Web application, the tool can be easily integrated using static Web pages, servlet, portlet technologies. In this case, it is necessary to think about the extent of Web application window before integrating into any user interface. Integration is done by adopting the script located in the “Maintain.html” page.

Development Manual

A.2.5      Tool Structure

ERID tool is composed mainly of the following classes:

CrawlerServiceImpl – serves for communication with WebCrawler via XML-RPC protocol and client browser application.

MaintainServiceImpl – serves for communication with CM and ERID tool and client browser application. This class implements some application logic on a server machine.

Figure 3: OfferMaintenance tool block schema.

A.2.6      Method Implementation

Methods for maintenance are composed of procedures processing WebCrawler, ERID and CM method outputs to create solid application logic. User interface is developed using GWT (Google Web Toolkit) libraries that moves some application logic to client browser, instead of load server’s processor(s).

Figure 4: Dependency diagram for OfferMaintenance tool. Modules serve for web server and browser communication and application logic integration.

A.2.7      Enhancements and Optimizing

Not an issue.

A.3          Manual for Adaptation to Other Domains

A.3.1      Configuring to Other Domain

OfferMaintenance tool can be adapted to provide maintenance actions over Web pages from different domain by training ERID tool and its application in the process of crawling. There are no adaptation actions needed for using OfferMaintenance in other domains.