Method for Automated On-line Annotation (tool Pannda)
Extracts ontolgy individuals from text according to domain ontology.
Institution: | Slovak University of Technology |
Technologies used: | Java, Sesame, SimMetrics, Apache Lucene, QTag, MySQL |
Inputs: | HTML or text document, domain ontologies |
Outputs: | Ontology individuals extracted from text with coordinates of each individual within text |
Documentation: | HTML, doc, JavaDoc |
Addressed Problems
The task of recognition of the sense of shown data on web is mostly trivial for a person, but often very difficult, if not impossible, for a machine. That's why we aim to put semantic to present web, so data on web can be accessed and understood not only by a human, but also by a machine. Moreover, the Pannda tool uses recognized semantic data during web browsing to mark relevant parts of text which could be interesting for the reader.
Description
The main task is to design a method to simplify navigation on a web page by enhancing the page content with useful annotations. This annotation process is based on known ontologies and user preferences. Annotations are added into the document on-line, whenever a page is accessed by web browser.
The task of annotation is done within five steps by utilizing of four different algorithms. We can split these steps into two different groups. In the first one, the aim is to find parts of processed text, which should represent (or match) concrete instances from given ontologies describing domain. We are recognizing instances according to:
- their labels - text and labels are normalized first (splitting into words, lowercasing, transfering to root form and removing stop-words). If all key-words from an instance are following each other (in any order), this part of processed text is marked as found instance.
- regular expressions - need to specify what regular expressions should be used to find an instance (ontology has to be adapted)
The second group of our annotation steps aims to find parts of text which could be potentially instances of known concepts from given ontologies. We are recognizing concepts according to:
- their labels - we assume html document as input and relay on html-classes naming (labels and class names are normalized first)
- regular expressions - similar to second step of previous group
- language patterns - uses regular expressions too, but they are not concept or instance specific, but language specific, thus we don't need to adapt existing ontology
References
- Martin Adam (2007). An Approach to Automated On-line Annotation. In Proc of research project workshop, Tools for Acguisition, Organisation and Presenting of Information and Knowledge, P. Návrat et al. (Eds.), Polana, Slovakia.