A              RIDAR – Method for Relevant Internet Data Resource Identification

A.1          Basic Information

Information acquiring systems often require identifying primary Internet resources. RIDAR allows exploiting existing search engines to retrieve links to relevant Internet resources based on users-supplied search terms or more complicated search expressions. Details about identified resources (URL, title, etc.) are stored into databases, or sent to other tools for further processing.

RIDAR is used in scope of the NAZOU project primary for source identification of job offers on the Internet. This method exploits the potential of existing search services for acquisition of links to existing potential resources of job offers on the Internet. Thus extensive space of keyword-indexed data sources, covered by existing search engines such as Google[1], AllTheWeb[2] or Yahoo![3] is utilized.

A.1.1      Basic Terms

API – Application Programming Interface

SOAP – Simple Object Access Protocol

WSDL – Web Service Description Language

A.1.2      Method Description

This method exploits the potential of existing search engines to identify relevant information resources on the Internet based on users-supplied search terms or more complicated search expressions. RIDAR can integrate any search engine which exposes a web service API. Currently Google API[4] and Yahoo! API[5] search engines are supported and had been integrated.

RIDAR provides generic interfaces which allow integrating search engines as well as targets for storing search results (databases). RIDAR also allows storing retrieved results into any target such as database or generic file. Currently MySQL and NazouDB target is implemented in RIDAR.

Method utilizes search based on keywords or collocations describing particular domain in existing search engines. The tool integrates web service APIs to search engines. These APIs had been implemented using web services based on SOAP and WSDL.

A.1.3      Scenarios of Use

RIDAR is suitable in two scenarios:

Before going to use the tool for any of the two above mentioned scenarios, one must gain access to search engine APIs. Integration of a search engine follows these steps:

License key (or application ID) must be used each time the API is accessed. One will receive authorization key for unique access to a search engine API. In basic registration number of queries is limited for each license key. Extended or unlimited number of queries is provided on a paid basis. RIDAR allows managing several auth keys from different search engines.

Search engines Google and Yahoo! have been registered and license key for Google and application ID for Yahoo was obtain just for the purpose of the NAZOU project.

A.1.4      External Links and Publications

BALOGH Z.: RIDAR – RELEVANT INTERNET DATA RESOURCE IDENTIFICATION. In: Laclavik M. et al.: WIKT 2006 Proceedings, 1st Workshop on Intelligent and Knowledge-oriented Technologies, ISBN 978-80-969202-5-9, pp.122, 2007, Bratislava, Slovakia.

HLUCHÝ, Ladislav - ŠELENG, Martin - ORAVEC, V. - BUDINSKÁ, Ivana - LACLAVÍK, Michal - GATIAL, Emil - BALOGH, Zoltán - CIGLAN, Marek. Data transition chain. In HLUCHÝ, Ladislav. Tools for acquisition, organisation and presenting of information and knowledge : proceedings in informatics and information technologies. - Košice : Vydavateľstvo STU, Bratislava, 2007. ISBN 978-80-227-2716-7, part 2, P. 79-91.

GATIAL, Emil - BALOGH, Zoltán - HLUCHÝ, Ladislav - VOJTEK, Peter. Identification and acquisition of domain dependent internet resources. In HLUCHÝ, Ladislav. Tools for acquisition, organisation and presenting of information and knowledge : proceedings in informatics and information technologies. - Košice : Vydavateľstvo STU, Bratislava, 2007. ISBN 978-80-227-2716-7, part 2, P. 68-78.

GATIAL, Emil - BALOGH, Zoltán. Identifying, retrieving and determining relevance of heterogenous internet resources. In Research Project Workshop. Tools for acquisition, organization and presenting of information and knowledge : proceedings in informatics and information technology. - Bratislava : Slovak University of Technology Bratislava, 2006. ISBN 80-227-2468-8, s. 15-21.

A.2          Integration Manual

A.2.1      Dependencies

The following software is required in order to install RIDAR:

A.2.2      Installation

Herein we describe how to install the RIDAR tool:

  1. Set the ${NAZOU_HOME} and ${NAZOU_SVN} environment variables. ${NAZOU_HOME} is any directory to which you wish to install NAZOU project components.  ${NAZOU_SVN} is the name of directory where we will checkout the RIDAR source tree from a version control system. Create the corresponding directories (if they do not exist):

$ NAZOU_HOME=/usr/local/nazou

$ NAZOU_SVN=/tmp

$ mkdir ${NAZOU_HOME}

  1. Checkout the source code from SVN:

$ mkdir ${NAZOU_SVN}/ridar

$ svn co https://nazou.fiit.stuba.sk/svn/nazou/ridar ${NAZOU_SVN}/ridar

  1. Create MySQL database for RIDAR:

$ cd ${NAZOU_SVN}/ridar/scripts

$ mysql –uridar –p **** < ./ridar.sql

where –u and –p specifies the database user and password, which must be set up before importing the database structure. The SQL queries for tables RIDAR are in the ridar.sql file.

  1. Deploy the RIDAR tool:

$ ${ANT_HOME}/bin/ant -DDEPLOY_DIR=${NAZOU_HOME}/ridar dist

where ${ANT_HOME} is the home directory of the Ant software suite.

At this stage you are ready to use the RIDAR tool.

A.2.3      Configuration

The configuration requires only setting the database name (dbname) and user (dbuser) and password (dbpass) in the ${NAZOU_HOME}/ridar/config.properties file.

Default values are set in config.properties file in the following directory ${NAZOU_HOME}/ridar/src/nazou/ridar/

dbname=nazou-ridar

dbuser=nazou

dbpass=*****

Configuration also implies the definition of search plans. Plan is a record in a database which specifies which search engine should be queried by which search expression and how many results should be retrieved and stored by RIDAR. The search expressions syntax may differ for each search engine, therefore the user must make sure that the inserted search expressions are valid for the specified search engine. The structure of the search plan record table is the following:

Field            Type

id_schedule      bigint(11)

engine            varchar(32)  

query_text       varchar(255)

pages            bigint(20)

processed        bigint(20)

An example entry is the following:

id_schedule
    engine    query_text                                  pages  processed

1   yahoo     job                                         6      0
2   google    career                                      3      0
4   google    site:www.profesia.sk list_offers.php3 2      0

The processed field is incremented upon every search engine query and retrieval of a new query result page.

A.2.4      Integration Guide

This section describes how to use the tool. The usage consists of two steps. First is the definition of search terms and expressions and the second is the execution of the RIDAR tool.

  1. Search term and expression definition

Search terms are stored in a database in order to allow plan and scheduling of individual search executions.

  1. Execution of the RIDAR tool

RIDAR is simply invoked by executing a proper Java class. Since we need to include several required JAR files for RIDAR to run, there is a special script for invoking RIDAR in the SVN:

$ ${NAZOU_HOME}/ridar/scripts/run_ridar.sh

The content of this script is the following:

#!/bin/bash

JAR=.

for jar in ${NAZOU_HOME}/ridar/lib/*.jar

do

 JAR=${JAR}:${jar}

done

 

cd ${NAZOU_HOME}/ridar

 

${JAVA_HOME}/bin/java \

 -classpath ${JAR} \

 nazou.ridar.core.Ridar

This script simply reads in all relevant libraries from the ${NAZOU_HOME}/ridar/lib directory and executes the main nazou.ridar.core.Ridar class.

For some systems it makes sense to execute RIDAR on the regular basis. This can be achieved by placing a record into /etc/crontab file which will result in periodic executions by the cron daemon:

00 * * * * nazou nice -n 19 ${NAZOU_HOME}/ridar/scripts/run_ridar.sh

This record in the crontab file will cause a regular RIDAR execution every hour.

A.3          Development Manual

A.3.1      Tool Structure

All the packages are organized under the top nazou.ridar (refered to as n.r in the diagrams) packages. The tool is organized in the following packages:

Figure 1: Organization of packages.

The main class is the n.r.core.Ridar class. The main() method of this class is invoked upon RIDAR tool execution. The class has the following dependencies:

Figure 2: The Ridar class and its dependencies

The next important class is the n.r.core.LinkRecordStorage class. Upon instantiation of this class we must specify both a search engine and a database storage interface implementations:

Figure 3: LinkRecordStorage class and its dependencies.

This class enables to query the specified search engine and to store the queried results through the specified database storage. The structure of the table where queried links are stored is the following:

Field            Type    

id               bigint(20)

title            varchar(255)

url              text

insert_date     timestamp

query_text       varchar(255)

engine           text    

webcrawler       tinyint(1)

As for the database interface there are two basic implementations of the n.r.db.DatabaseInterface interface:

Figure 4: MySQL and NazouCM are the two implementations of the DatabaseInterface.

 

As for the search engine interface there are also two basic implementations of the n.r.searchengines.SearchEngineInterface interface:

Figure 5: Google and Yahoo are the two implementations of interface SearchEngineInterface.

The package n.r.cocoon is a class which was used to demonstrate the RIDAR tool during the first evaluation of the project. The package n.r.results holds two classes which are used to represent and hand over search query results between individual classes.

A.3.2      Method Implementation

The tool upon an execution creates a new Ridar class instance and invokes its run() method. Inside the run() command a search plan is retrieved from a MySQL database. If the search plan is retrieved successfully then the search plan is executed by calling the void execSearchPlan(LinkedList<Plan> searchPlan) method. In this method each plan record from the search plan list is traversed and based on the search engine specified in the database an appropriate search engine query is executed. Each search engine is queried via the int querySearchEngine(String query, long from, long pages) method of the LinkRecordStorage class implementation. LinkRecordStorage class instance is initialized with selected SearchEngineInterface and DatabaseInterface implementation.

A.3.3      Enhancements and Optimizing

In order to enhance RIDAR by an additional search engine, one should implement the SearchEngineInterface class and its public LinkRecordSet search(String query, long pos) method.

In order to enhance RIDAR by an additional database storage engine, one should implement the DatabaseInterfaceInterface class and its methods (see above).

A.4          Manual for Adaptation to Other Domains

A.4.1      Configuring to Other Domain

Not relevant.

A.4.2      Dependencies

No dependencies.

 



[1] http://www.google.com/

[2] http://www.alltheweb.com/

[3] http://www.yahoo.com/

[4] http://code.google.com/

[5] http://developer.yahoo.com/