A              Pannda – Automated On-line Annotation

A.1          Basic Information

The task of recognition of the sense of shown data on web is mostly trivial for a person, but often very difficult, if not impossible, for a machine. That's why we aim to put semantic to present web, so data on web can be accessed and understood not only by a human, but also by a machine. Moreover, the Pannda tool uses recognized semantic data during web browsing to mark relevant parts of text which could be interesting for the reader.

Pannda tool aims to simplify navigation on a web page by enhancing the page content with useful annotations. This annotation process is based on known ontologies and user preferences. Annotations are added into the document on-line, whenever a page is accessed by web browser.

A.1.1      Basic Terms

On-line annotation

Annotation process which is done during the document accessing. It’s opposite approach to the offline annotation, when documents are prepared / annotated in advance.

Concept pattern

Regular expression used to find concrete concepts within input text. Every concept in ontology can have as many patterns as necessary.

Instance pattern

Regular expression used to find concrete instance within input text. Every instance in ontology can have as many patterns as necessary.

Language pattern

Regular expression used to find any instance (even unknown) of any (known) concept from ontology. It can contain pseudo elements (i.e. <concept>, <instance>) which are automatically transferred into regular expressions. These patterns are typically language specific and are not bound to any concrete concept or instance.

A.1.2      Method Description

The task of annotation is done within five steps by utilizing of four different algorithms. We can split these steps into two different groups. In the first one, the aim is to find parts of processed text, which should represent (or match) concrete instances from given ontologies describing domain. We recognize instances according to

The second group of our annotation steps aims to find parts of text which could be potentially instances of known concepts from given ontologies. We are recognizing concepts according to

A.1.3      Scenarios of Use

Pannda can be used to enhance texts or web pages content by adding useful annotations to help the user oriented within some domain or to help him understand the text. Annotations are inserted into document on-line, meaning that a web page can be annotated while it’s being transferred to the user. Possible usage scenarios are as follows:

Pannda should not be used in following cases:

§  There is no domain ontology available for the same domain as used in browsed pages.

§  There are no (or too few) instances and annotation rules in the ontology.

A.1.4      External Links and Publications

§  Martin Adam (2007). An Approach to Automated On-line Annotation. In Proc. of research project workshop, Tools for Acquisition, Organization and Presenting of Information and Knowledge, P. Návrat et al. (Eds.), Po¾ana, Slovakia.

§  Tomcat, a Java servlet container, Apache Software Foundation. (tomcat.apache.org)

A.2          Integration Manual

Pannda is developed in Java (Standard Edition 5) as a library. It should be used in conjunction with other applications that requires on-line annotation. The distribution of Pannda consists of these parts:

A.2.1      Dependencies

Pannda uses these external tools and libraries:

A.2.2      Installation

Before deploying Pannda into an application the following prerequisites must be met:

Deploying Pannda involves these steps:

A.2.3      Configuration

Pannda must be configured to use suitable ontology storage, lexical data (pannda.properties) and language dependent annotation patterns (regexp.patterns).

The pannda.properties file contains, inter alia, following values:

§  PANNDA_NAMESPACE – namespace of Pannda elements in used ontology files.

§  ONTOLOGY_x – defines the ID (x), caption and file name of ontology. There can be as many these rows in configuration as needed (use a number instead of “x” for every new entry).

§  COSINE_SIMILARITY_TRESHOLD – threshold for cosine similarity string comparer used to determine if two strings are similar (i.e. identical when comparing).

§  PART_OF_SPEACH_GROUP_x – defines abbreviation for several part-of-speech types at once. There can be as many these rows in configuration as needed (use a number instead of “x” for every new entry).

A.2.4      Integration Guide

Pannda annotates text documents (including web pages) on-line when they are being processed either on server or client side. It’s possible to use the annotation functionality only by utilizing one single method (Core.annotate). But the library contains also a Cocoon transformer API (class CocoonTransformer) which is recommended to use with Cocoon applications. In such case also an additional parameter (configPath) for path to the configuration file has to be defined within sitemap file. The annotations are created without direct user interaction and are shown directly inside the web page (as shown in Figure 1).

For detailed description of individual methods see the accompanying javadoc documentation.

Figure 1. Sample annotation.

Error handling

Errors are mostly caused by bad configuration and/or inappropriate input (e.g., neither text file nor web page). If an error is too serious or unknown and thus cannot be reasonably handled, an exception is thrown and a log entry is made.

A.3          Development Manual

A.3.1      Tool Structure

Pannda consists of the following packages:

A.3.2      Method Implementation

The implementation of Pannda is relatively straightforward. Its operation consists of three primary tasks. First, annotate whole input document. Second, enhance annotations with additional data and actions if possible. Last, eliminate ambiguities by merging touched annotations into one.

The annotation process is done within five steps as follows:

In last step, annotations related to the same part of annotated text are merged into one annotation. Starting and ending location of annotation in text is chosen so that all original annotations are located within the merged one.

A.3.3      Enhancements and Optimization

Originally, Pannda was implemented as a stand-alone web service running in Apache Tomcat communicating directly with a web browser via a plug-in using Annotea[1] protocol. Unfortunately this was not optimal solution as there was need to implement several plug-ins for every browser and also the Pannda service was not always able to get the same web page content as browser which resulted into incorrect annotation.

Pannda was optimized only limited for fast response times by caching often used data. From experience with the first web service implementation we can assume, that one of the slowest parts was HTML processing, which isn’t necessary in current implementation at all as the document comes already parsed as SAX events.

Further optimization is possible, for example by returning to the service architecture using more suitable protocol and implementing a service consumer on the server side of NAZOU project.

A.4          Manual for use in Other Application Domains

The method used in Pannda to search for annotations and display them to the user is domain independent. However to work properly, using this tool in other application domains requires to extend used ontology to contain search patterns (regular expressions). Better results will be provided if the ontology has also enough instances defined. When changing the application domain, adjusting language patterns (in separate file) should be considered, as some of them may be partially domain specific.

All mentioned steps can be done without any code or configuration changes. Thus no application changes are required to use Pannda in another application domain.

A.4.1      Configuration for use in Other Application Domains

No domain specific configuration, other than the one described in the configuration section in necessary. Only language-patterns revising is advised.

A.4.2      Dependencies

There are no domain specific dependencies used in Pannda.

[1] http://www.w3.org/2001/Annotea/