Omin - Valuating elements of web pages

Tool ranks markup elements of web pages by their visual importance.

Institution: Safarik University
Technologies used: Java, CSS Parser, diff
Inputs: XHTML documents
Outputs: Pairs of attribute names and attribute values

Addressed Problems

Many web pages which offer commercial products or services are generated with the help of template engines. Therefore their inner structure and format contains a low amount of alterations and changes. By comparing the source and structure of multiple web pages we can discover the possible attribute values of presented objects, which likely occur on the places of difference.

Description

The Omin method takes two or more job offer web pages from a single portal and finds their structural or textual differences by using the modified diff algorithm. Since the templates for the page have only a limited number of variables, and these are most often the actual attribute values, we can reconstruct the values of these variables by performing multiple comparisons of the webpage pairs, thus discovering the difference spots.

Discovered difference spots have to be mapped on the actual attribute values of objects. An extraction ontology which uses regular expressions is used to refine this mapping process.

We provide various methods and heuristics which obtain the relevant attribute values in these different spots and which discover possible attribute names in the vicinity of the attribute values. (One of the names discovery method is based on the visual emphasis achieved via CSS rules.)

OMIN Schema

Schema of OMIN method components