A              Wrapper

A.1          Basic Information

Web page wrapping is a process of retrieval of information stored on web pages and their capture into structured form. The retrieval consists of identification of relevant pieces of information on the page and their storage to XML, database or ontology. Wrappers performing web page wrapping differ mainly on the degree of automation. We categorize the wrapping approaches into the following categories:

§  Wrapper creators – These systems support the wrapper developer by providing functions for defining structures to be wrapped from the web page. Programming languages equipped by HTML and string manipulation functions are in this category. The developer has a wide range of possibilities to define the document structures to be wrapped, but has to develop the wrapper manually. The drawback of this manual way of development lays in it being a tedious, cumbersome process, however, provides wider range of possibilities than more automated methods.

§  Wrapper inducers – The wrapper creation process is supported by automatic wrapper inducement. The developer specifies generally only positive and negative examples and the system produces extraction rules[1]. Although the process of wrapper creation is not that tedious, the wrapper developer loses his wide control over wrapping operations as it was known in the previous wrapper definition style.

Automatic extraction rule generators – Several systems exploits the repetition of the pattern in the documents. Web page often contains lists, which consists of similar elements. Thus systems can induce knowledge from the web pages even before getting any example from the wrapper developer.

A.1.1      Basic Terms

Document pattern

part of the document fulfilling condition defined or taught by the developer of the wrapper; pattern is a coherent part of the document; patterns can have more instances, in our terminology subdocuments

Document filter

condition distinguishing the document pattern

Pattern extraction

process of applying filters on the document

Subdocument

an instance of a document pattern

Document representation

documents can be represented as DOM trees (in case of web pages), strings or other ways; depending on the representation different pattern learning strategies can be applied

Learning strategy

method applied for learning the document filter, the learning strategies currently exploits machine learning techniques; the examples necessary for the learning process are gathered from user examples

-                        Method Description

Two methods for the web page wrapping problem were applied. In our first approach a wrapper development was designed and implemented. The wrappers were specified by the user. A wrapper induction method was researched in our second approach. We describe these two approaches.

Wrapper definition framework – The wrapper is defined as a sequence of actions. Actions are elementary executive units (nodes), which are connected together into a sequence. The tree root is the wrapper program's first action and defines the starting point of its execution. Connections between individual actions thereafter define the control flow of the wrapper program. Actions are divided into these four main categories:

§  Navigation actions are used for navigating on the Web and parsing web documents into DOM structure such as loading pages, following hyperlinks, and other.

§  Extraction actions are used for data extraction using XPath and regular expressions into structured output objects or context variables.

§  Iterative actions are used for execution of the actions following in the sequence iteratively. The following actions are iterated over a list of links of the web page or the Next links if the list is on more pages.

Wrapper induction by machine learning – We were concerned about the aspect of pattern training in our second approach. The trained wrapper can distinguish parts of the document belonging to the learnt pattern.

We distinguish several possible document representations. We describe a document as a set of elements. We call these elements subdocuments. The subdocuments have attributes. The document representations can differ in a) the way of partitioning the documents into subdocuments, b) the attributes describing the subdocuments and c) the relation among the subdocuments.

We considered two types of subdocuments so far in our project:

§  XML representation – The subdocument is represented as W3C Document Object Model (DOM) structure. The elements of the documents are addressable by XPath expression. On the applied learning method we operate on the XPath expression of the subdocument.

§  Attributed elements – In this case the subdocuments are also elements of the DOM-tree. The attributes of the subdocuments are type (name) of the element, style class, depth of the element in the DOM-tree and index (if the element is in a table).

The framework of our application enables to define other types of subdocuments easily. The wrapper inducer is trained on the user examples, which are represented as subdocuments. The result of the training is a generalization of the examples. We set the following conditions on the generalization:

§  Each positive example must be in the resulting selection.

§  If negative examples are used, no negative example can be in the resulting selection.

We describe shortly the learning strategies we implemented for generalization of the document pattern.

Simple XPath learning strategy. The strategy generalizes only from positive examples and operates on XML document representation. The process of learning starts with an empty filter. The examples are XPath expressions.

Attribute Selection Learning Strategy. We apply machine learning methods on the attributes (listed in description of the document representation) of the subdocument. The strategy is able to learn from positive and negative examples. The ``positivity'' of an example is used a next attribute of the examples. If any of the attributes is missing, its value is set to special value and is used in process of classification. We made experiments only with the Bayesian classificator so far.

A.1.2      Scenarios of use

The wrapper tool is suitable for the following scenarios:

§  definition of a wrapper program,

§  execution of a wrapper program and storing the retrieved data to structured form.

The wrapper is unintended for implementation of wrapper program on a web site and its execution on a different one. The tool is intended for site wrapping and not domain wrapping.

Following types of WRAPPER users are distinguished:

§  Wrapper developer creates the wrapper either by declaring what it has to do or by giving examples to the wrapper inducer (in case of WRAPPER v2). Please note that the role of wrapper developer is different from the developer, who develops the wrapper tool itself. The development guide is in this document latter.

§  Wrapper executer starts the wrapper implemented by the developer. A wrapper once developed should be reusable for execution many times. As development of a wrapper might be time consuming and knowledge demanding execution of the wrapper is a straightforward process.

§  Administrator of wrappers controls the regular execution of several wrappers. Regularity is usually necessary for applications which exploit the data gathered by the wrappers. The administrator should have a knowledge of the wrapper executor and good knowledge of administration of servers, databases and ontology repositories.

A.1.3      External Links and Publications

 

A.2          Integration Manual

The code is of Wrapper is developed in Java 5. It is distributed as a jar file. It exploits several packages, which are packed to the jar file. Wrapper is a stand-alone application. Behalf the jar execution code also configuration files have to be placed in the directory where the Wrapper is executed from. Depending on the output environment database, ontology repository might be set up. The details of the installation and configuration are described in the following sections.

A.2.1      Dependencies

Wrapper depends on the following packages: Jaxen (1.1b8)[2], Weka (3.3.6)[3], Xalan (2.7.0)[4], Nekohtml (0.9.5)[5], Log4j (1.2.14)[6], Jrex (1.0b1)[7], Sesame (1.2.6), Openrdf (1.2.6) [8], POI (2.5.1)[9], RIO (1.0.9)[10], Commons-fileupload (1.1)[11], Commons-lang (2.1)[12], Commons-httpclient (3.0)[13], Commons-validator (1.2.0)[14], Commons-logging (1.0.3)[15], Commons-codec (1.2)[16].

We use NALIT (0.3.1) for identification of the language.

A.2.2      Installation

Installation is of Wrapper is a straightforward process:

1.    unpack the wrapper.zip file containing the jar file and the configuration files

2.    configure the new Wrapper instance

A.2.3      Configuration

The following configuration files are present in the Wrapper tool:

§  <root>/.properties – property file required by NALIT

§  <root>/conf/log4j.properties – Log4j logging properties

§  <root>/conf/dunning-2-profiles/ directory of NALIT language profiles.

The NALIT and Log4j properties are documented in NALIT and Log4j.

A.2.4      Integration Guide

The WRAPPER tool has two versions which have the same aim, but different philosophy in defining the wrappers. WRAPPER v1 was designed to create wrappers by specification of the wrapper data pieces manually by the developer of the wrapper. WRAPPER v2 is a wrapper inducer, where the user gives examples of the document patterns which are intended to be retrieved.

Common parts of the two versions will be described first and the process of creation of wrappers separately latter.

User interfaces

Two interfaces support the users in their tasks.

§  Wrapper development environment – a GUI tool which provides environment for wrapper creation, debugging and execution; this is tool of the wrapper developer

§  Command line wrapper executor – capable for wrapper execution and step-by-step debugging; the tool is intended for wrapper executors and administrators

Wrapper Development Environment

Figure 1 depicts a screenshot of the Wrapper Development Environment. The GUI of the environment consists of the following parts:

§  Menu and toolbar – the toolbar contains the most typical action the user wishes to execute;

§  List of wrapping actions – it is a palette from where the user can choose the new action she wishes to add under the currently chosen action in the wrapper program;

§  Workspace – contains the wrapper project, visualizes the tree of wrapping actions.

Figure 1. Wrapper Development Environment.

Further we describe the workspace and the interpreter of the Wrapper Development Environment. However, WRAPPER v1 and v2 operates with a different set of actions, we therefore give only the generic description of the workspace. The specific set of actions will be described in the section devoted for WRAPPER v1 and v2, respectively.

Workspace

The wrapper program has a tree like structure, however, in WRAPPER v1 only a sequence is used in the wrapper. The nodes in the tree are the actions. The workspace is the visualization of this tree. The user is enabled to manipulate the tree by collapsing/expanding its branches. A pop-up menu, see Figure 2, can be used for edition of the actions. The actions can be removed or edited. Every action has specific argument and therefore a different dialog handles the change of the action parameters. The list of actions will be discussed later.

Figure 2. Action pop-up menu.

Interpretation of the wrapper

Execution of the wrapper program can be proceeded by the invoking the Interpreter dialog, which is depicted on Figure 3. This tool is capable of execution, debugging and definition of the target environments. The wrapped structures are stored in output objects. Output objects are tree structures. The wrapped data is stored in the nodes of the tree.

Figure 3. Interpreter dialog.

Debugging mode can be switched on/off by the “Debug mode” switch box. By button “Run” the wrapper is executed. Button “Continue” is active only in debugging mode. Breakpoints are not enabled in Wrapper Development Environment. But it is possible to declare them in the Wrapper Command Line tool. In debug mode the context and the actual state of the output objects can be viewed.

The target environments are defined in the dialog which is depicted on Figure 4. The data gathered during the wrapping process in stored in the target environments defined in this dialog. Several types of environments are distinguished:

§  Relational database

§  Sesame ontology repository

§  XML document

§  file

Figure 4. Target environments.

Different target environments have different parameters. Dialog for setting the parameters of XML document target environment is shown on Figure 5.

Figure 5. XML document target environment.

In debugging mode it is possible to view the state of the context and the output objects. See dialog showing the state of output objects as an XML document on Figure 6.

Figure 6. State of output object.

Command line wrapper executor

Wrapper executors and administrators do not need a GUI for fulfilling their task. The command line wrapper executor is simple tool which executes the wrappers developed in the Wrapper development environment. The step-by-step debugging feature may be handful when the wrapper execution fails, for instance because of a change in the wrapped site.

The command line tool has the following parameters:

Parameter

Number of

arguments

Description

-w

1

Obligatory parameter giving the wrapper project file name.

-x

0 or 2

Defines an XML output file. The first parameter is the name of the output file; the second is the name of the root node in the output object. If no argument is given the standard values are output.xml/document.

-o

3

Defines an ontological output object. The argument mean: 1) name of the RDF mapping file; 2) URI; 3) name of the output RDF/XML document.

-a

1

Specifies user specified HTTP client.

-p

1

Specifies the address and port of the proxy server.  Has to be defined as server:port

-h

0

Show the user help.

-d

0 or more

Turn the wrapper executor to debugging mode. If no argument is given, the wrapper stops on every action, otherwise only at those action is the execution stopped which are listed as arguments. In the debugging it is possible to view the context of the wrapper.

Creation of a wrapper in WRAPPER v1

In WRAPPER v1 the wrapper program is expressed by a simple sequence of actions, not by a tree as it is in WRAPPER v2. The action aims to define exact receipts what has to be wrapped from the web site. Actions for navigation (LoadPage, FollowLink), extraction (ExtractData, WriteObject) and iteration (ForEachTag, DoWhileNextLink) are distinguished in WRAPPER v1.

action type

action

description, parameters

navigation

LoadPage

Loads a web page and stores it under a document name in the context.

Parameter:

§  URL of the web page,

§  Name of the document.

FollowLink

Load a web page following a link of a loaded document.

Parameters:

§  Document where the link is placed,

§  XPath expression expressing the subtree of the document,

§  regular expression – filtering only the part of the content which contains the link,

§  name of the document where the retrieved web page has to be stored.

extraction

ExtractData

Extracts content from document and puts it to an output object.

Parameters:

§  name of the source document,

§  XPath expression,

§  regular expression,

§  name of the output object,

§  path in the output object.

WriteObject

Stores output object in the target environments.

Parameters:

§  name of the output object to be stored.

iteration

ForEachTag

Executes the following actions in the wrapper program on the DOM subtrees defined by XPath.

Parameters:

§  name of the wrapped document where the XPath has to be applied,

§  the XPath expression selecting the tree part,

§  name of the document which the subtrees have to be saved to.

DoWhileNextLink

Cyclically executes the following actions filling a document with web pages gained from links on the page.

Parameters:

§  name of the document  which the document is applied on,

§  the XPath expression defining the subtree containing the link,

§  regular expression for manipulation of the subtree content,

§  iteration limit.

Wrapper induction in WRAPPER v2

The second version of the WRAPPER uses local context instead of a global one, as it was implemented in its first version. This brings the advantage of building wrapper programs which are more intuitive. The wrapper programs are trees of wrapping actions, which reflects the logical structure of the web page information content. The second version introduces wrapper induction instead of wrapper definition. The user is able to teach what needs to be wrapper instead of defining how it has to be wrapped. The intention was to lower the necessary level of knowledge and the need to deepen to the nitty-gritty details of the web page structure. The developer should concentrate on giving examples of document patterns instead of defining XPath and regular expressions.

The set of wrapping actions also differs from those in the first version.

 

action type

action

description

navigation

LoadPage

Loads a web page and stores it under a document name in the context.

Parameter:

§  URL of the web page.

FollowLink

Load a web page and follow a line defined by a learnt pattern.

Parameters:

§  URI of the document intended for parsing,

§  iteration limit – maximum number of links to be followed; can be set to infinity,

§  learning strategy – learning approach applied in the learning process.

The patterns are learnt in an interactive matter through a JRex browser.

extraction

ExtractData

Extracts content from document and puts it to an output object.

Parameters:

§  learning strategy – learning approach applied in the learning process,

§  output object ID,

§  path in the output object – the position in the tree where the extracted data in stored.

The output object ID and the path are optional. If they are left out the document changes only in the context, nothing is stored in the output objects. The patterns are learnt in an interactive matter through a JRex browser.

PushData

Pushes the output object to the target environments.

Parameters:

§  pattern type – the type to which the actual pattern is converted,

§  output object ID,

§  path in the output object – the position in the tree where the extracted data in stored.

The output object ID and the path are optional. If they are left out the document changes only in the context, nothing is stored in the output objects.

Interactive learning of DOM patterns

The document patterns are taught using an interactive wrapper learner. It is an enhanced browser, an implementation in JRex, which is interfacing Mozilla, a reliable web browser. The process of interactive pattern learning follows a simple algorithm:

1.    the user chooses the first example from the web page

2.    repeat the following instructions until the user is satisfied with induced pattern

a.    the user chooses an example from the web page

b.    the pattern inducer generalizes from the examples; the user can see the induced pattern

The user has to define at least two examples. A screenshot of the browser with highlighted instances of the generalized patterns is shown on Figure 7.

Figure 7. Enhanced browser designed for pattern learning.

A.3          Development Manual

The architecture and a guide to the extension of WRAPPER v2 is discussed in this section. The logic and the architecture of the second version of the tool changed in most parts of the tool. Development manual is not discussed for WRAPPER v1. Extension of the second version is considered for the future.

Tool architecture

The architecture of the tool (depicted on Figure 8) consists of interaction modules (presentation module, communication module), core modules (application and application presentation modules) and outputting module.

Figure 8. Rough system architecture.

Presentation and communication module of the interactive browser

The patterns on the web page are induced during the interactive process with the user. This process is supported by an enhanced browser built on JRex, a Java-Mozilla interface. The enhanced browser is capable of gathering positive or negative examples of document patterns. The communication module is a controller of the example gathering process.

Application and application presentation module

The application presentation module serves the GUI of WRAPPER tool. The user interface supports the user in wrapper construction, edition, debugging and interpretation. Package sk.fiit.wrapper.gui contains the GUI related code.

The wrapper program is a tree structure of actions. The state of the wrapper programs execution is captured in the Context object. The Context object contains the actual subdocument, output objects, web page cookies and authentication data necessary on the web page. Subdocument is a representation of the document part. The representation can take different form. Currently the WRAPPER tool supports DOM representation and language code representation. The former representation is used generally during the extraction process. The later has sense to use only in PushData actions for outputting the language code of the subdocument. As other types of subdocuments string representation and visual representation of the document are possible and may be useful.

Pattern contains a Filter object, which for a given subdocument produces another subdocument fulfilling the filter’s conditions. The Filter is the result of the learning process. The code related to the process of Filter learning is placed in packages sk.fiit.wrapper.core.actions.pattern.learning. The filters implements different learning strategies. These Filter types are located in package sk.fiit.wrapper.core.actions.pattern.filter.


Error handling

The error handling did not change in the second version. The errors occurring during the wrapping process are handled by error handling strategies. So far two strategies are implemented: StopThrowErrorHandler and ReturnBackErrorHandler.

In case of StopThrowErrorHandler a WrapperException is thrown when error either in the filtering process or in the parser is thrown. It is handled by the Interpret, the interpreter object of the wrapper program. Figure 9 depicts the schema of wrapper exceptions.

Figure 9. Schema of wrapper exceptions.

Exceptions thrown during the wrapping process are packed in ActionError and if the processing ends with an error, the error is further packed in ActionException and sent to method ownErrorHandler of the Interpret for handling. If the ReturnBackErrorHandler is used, the wrapper program is further processed.

Outputting module

Output objects has tree like structure and they are implemented as DOM trees. OutputObject is an interface with two implementations, DOMOutputObject and RelativeOutputObject. When an action writes to the output object, it has to specify the path in the tree where the data has to be stored. When a Context is passed to the next Action in the wrapper interpretation process, the OutputObjects are packed, using the decorator design pattern, by RelativeOutputObject, which makes the access to the OutputObject interface values relative.

A.4          Manual for Adaptation to Other Domains

The WRAPPER tool is intended to be usable on web pages of any domain. The web sites with uniform structure, usually generated by web page generating scripts, should be appropriate for the tool for wrapping. Domains where web page wrapping is usually applied on are:

§  job offers,

§  e-shops,

§  used car seller sites,

§  real estate sites,

§  travel agencies.

Wrappers are usually used usually for concurrency monitoring, price comparison sites and searching. As the WRAPPER tool is intended for development of wrapper for web sites and not domains. It is not feasible to gather data from sites, which the wrapper was not implemented for.

The architecture of the tool is flexible and can be simply extended by other types of documents and learning techniques for pattern learning. Extending the functionality of the WRAPPER to not strictly web applications is possible. We list two additional document types, which might be considered as useful extensions:

§  String documents – the documents are not taken as DOM documents, but as strings,

§  Networks – Network structures are present in the domain of Web, where networks are formed on social networking services. Wrapping specific subnetworks or vertices/actors in the network might be crucial for some applications.

A.4.1      Configuring to Other Domain

The WRAPPER applied on other domains means implementation of a wrapper for set of web pages of the domain. No specific configuration is necessary.

Extending the tool for being capable of wrapping also other document types involves development efforts, refer to the Development Manual.

A.4.2      Dependencies

Libraries used by the WRAPPER tool are necessary for wrapping the web pages. If the tool is extended to wrap from other then web sources, possibly other libraries are necessary and their choice depends on the expertise of the developer.

 

 



[1] The extraction rules do not necessarily mean rules as understood in machine learning. The learned document pattern can be expressed in other ways, such neural network, SVM, as well.

[2] http://jaxen.org/

[3] http://www.cs.waikato.ac.nz/ml/weka/

[4] http://xml.apache.org/xalan-j/

[5] http://people.apache.org/~andyc/neko/doc/html/

[6] http://logging.apache.org/log4j/1.2/index.html

[7] http://jrex.mozdev.org/

[8] Sesame and Openrdf - http://www.openrdf.org/

[9] http://poi.apache.org/

[10] https://rio.dev.java.net/

[11] http://commons.apache.org/fileupload/

[12] http://commons.apache.org/lang/

[13] http://jakarta.apache.org/httpcomponents/httpclient-3.x/

[14] http://commons.apache.org/validator/

[15] http://commons.apache.org/logging/

[16] http://commons.apache.org/codec/