Wrapper

A web page wrapping system gathering contents of semi-structured web pages as its input and generating structured output of extracted data (represented as XML, relational database or ontology)

Institution: Slovak University of Technology
Technologies used: java, JRex, weka, xalan, jaxen
Inputs: accessible Web resources
Outputs: structured output to database, XML file, Sesame ontology
Documentation: HTML, doc, JavaDoc

Addressed Problems

Intelligent information retrieval using automated tools is essential for extracting and keeping track of the changing contents on the Web. We aim to develop:

Description

Web page wrapping is performed by a wrapper program. The wrapper program is developed using a wrapper designer. A wrapping language have been designed for this purpose, which has a tree-like structure, with actions chained together forming a continuous control flow.

We emphasize the following aspects of web page wrapping:

  1. navigation in the web hyperspace, with respect to the common client-server communication issues such as cookie and session handling, user authentication via HTML forms, etc.;
  2. extraction of desired patterns from web pages using machine learning techniques; the user teaches the wrapper designer by giving examples of the desired document parts
Wrapper designer architecture

Wrapper designer architecture.

The document is presented using the JRex Mozilla-based framework. Teaching of document patterns is an iterative process:

  1. open the page you wish to wrap
  2. click on the part of the document which is an instance of the pattern you wish to wrap
  3. see how the wrapper designer generalized from your example
  4. if you are not satisfied continue by step 2, otherwise you are finished

References

  1. Peter Sýkora, György Frivolt, Andrej Janzo, and Vojtech Szöcs. Automatized Information Retrieval from Heterogenous Web Sources. In Pavol Návrat, Pavol Bartoš, Mária Bieliková, Ladislav Hluchý, and Peter Vojtáš, editors, Proceedings in Informatics and Information Technologies: Tools Acquisition, Organisation and Presenting of Information and Knowledge, pages 32-39, Bystrá dolina, Nízke Tatry, Slovakia, September 2006. Vydavateľstvo STU.