Tool provides user interface for page acquisition and validation management.

Institution: Institute of Informatics
Technologies used: Java, MySQL, JSP
Inputs: Page metadata stored in CM, copies of crawled pages
Outputs: Crawled page metadata modifications, page copies removal
Documentation: HTML, doc
Distribution packages: zip

Addressed Problems

WebCrawler tool along with the ERID can produce huge amount of raw web page copies linked by metadata stored in database part of CorporativeMemory (CM). As the process of focused crawling counts on the suboptimal setting of ERID tool to download domain specific pages, human has to supervise upon this process. Proposed tool provides user interface that allows maintainer to overview specific downloaded pages and enable to make operations like delete, recategorize the downloaded pages, invoke learning process of ERID upon the specific page set or initialize actualization of specific pages.


OfferMaintenance tool shows the list of page metadata according to specific criteria (data, URL regexp. filter) and allows to delete selected pages. If some of downloaded pages don't contain requested content then such pages can be marked as 'unwanted' and after then the learning algorithm of ERID tool can invoked to avoid crawling selected pages in the future. The ERID tool settings are modifies and archived. Moreover, the tool can warn the NAZOU system administrator against the possible expired content of crawled pages according to the page date (or page META tag 'EXPIRES').

Block schema of offer maintenance tool

Block schema of offer maintenance tool.