Wrapper
A web page wrapping system gathering contents of semi-structured web pages as its input and generating structured output of extracted data (represented as XML, relational database or ontology)
Institution: | Slovak University of Technology |
Technologies used: | java, JRex, weka, xalan, jaxen |
Inputs: | accessible Web resources |
Outputs: | structured output to database, XML
file, Sesame ontology |
Documentation: | HTML, doc, JavaDoc |
Addressed Problems
Intelligent information retrieval using automated tools is essential
for extracting and keeping track of the changing contents on the Web.
We aim to develop:
- a web page wrapping system gathering contents of semi-structured web pages as its input and generating structured output of extracted data (represented as XML, relational database or ontology).
- system learning document patters from user examples
Description
Web page wrapping is performed by a wrapper program. The wrapper program is developed using a wrapper designer. A wrapping language have been designed for this purpose, which has a tree-like structure, with actions chained together forming a continuous control flow.
We emphasize the following aspects of web page wrapping:
- navigation in the web hyperspace, with respect to the common client-server communication issues such as cookie and session handling, user authentication via HTML forms, etc.;
- extraction of desired patterns from web pages using machine learning techniques; the user teaches the wrapper designer by giving examples of the desired document parts

Wrapper designer architecture.
The document is presented using the JRex
Mozilla-based framework. Teaching of document patterns is an iterative
process:
- open the page you wish to wrap
- click on the part of the document which is an instance of the pattern you wish to wrap
- see how the wrapper designer generalized from your example
- if you are not satisfied continue by step 2, otherwise you are finished
References
- Peter Sýkora, György Frivolt, Andrej Janzo, and Vojtech Szöcs. Automatized Information Retrieval from Heterogenous Web Sources. In Pavol Návrat, Pavol Bartoš, Mária Bieliková, Ladislav Hluchý, and Peter Vojtáš, editors, Proceedings in Informatics and Information Technologies: Tools Acquisition, Organisation and Presenting of Information and Knowledge, pages 32-39, Bystrá dolina, Nízke Tatry, Slovakia, September 2006. Vydavateľstvo STU.