A RIDAR – Method for Relevant Internet Data Resource Identification
A.1 Basic Information
Information acquiring systems often require identifying primary Internet resources. RIDAR allows exploiting existing search engines to retrieve links to relevant Internet resources based on users-supplied search terms or more complicated search expressions. Details about identified resources (URL, title, etc.) are stored into databases, or sent to other tools for further processing.
RIDAR is used in scope of the NAZOU project primary for source identification of job offers on the Internet. This method exploits the potential of existing search services for acquisition of links to existing potential resources of job offers on the Internet. Thus extensive space of keyword-indexed data sources, covered by existing search engines such as Google[1], AllTheWeb[2] or Yahoo![3] is utilized.
A.1.1 Basic Terms
API – Application Programming Interface
SOAP – Simple Object Access Protocol
WSDL – Web Service Description Language
A.1.2 Method Description
This method exploits the potential of existing search engines to identify relevant information resources on the Internet based on users-supplied search terms or more complicated search expressions. RIDAR can integrate any search engine which exposes a web service API. Currently Google API[4] and Yahoo! API[5] search engines are supported and had been integrated.
RIDAR provides generic interfaces which allow integrating search engines as well as targets for storing search results (databases). RIDAR also allows storing retrieved results into any target such as database or generic file. Currently MySQL and NazouDB target is implemented in RIDAR.
Method utilizes search based on keywords or collocations describing particular domain in existing search engines. The tool integrates web service APIs to search engines. These APIs had been implemented using web services based on SOAP and WSDL.
A.1.3 Scenarios of Use
RIDAR is suitable in two scenarios:
Before going to use the tool for any of the two above mentioned scenarios, one must gain access to search engine APIs. Integration of a search engine follows these steps:
License key (or application ID) must be used each time the API is accessed. One will receive authorization key for unique access to a search engine API. In basic registration number of queries is limited for each license key. Extended or unlimited number of queries is provided on a paid basis. RIDAR allows managing several auth keys from different search engines.
Search engines Google and Yahoo! have been registered and license key for Google and application ID for Yahoo was obtain just for the purpose of the NAZOU project.
A.1.4 External Links and Publications
BALOGH Z.: RIDAR – RELEVANT INTERNET DATA RESOURCE IDENTIFICATION. In: Laclavik M. et al.: WIKT 2006 Proceedings, 1st Workshop on Intelligent and Knowledge-oriented Technologies, ISBN 978-80-969202-5-9, pp.122, 2007, Bratislava, Slovakia.
HLUCHÝ, Ladislav - ŠELENG, Martin - ORAVEC, V. - BUDINSKÁ, Ivana - LACLAVÍK, Michal - GATIAL, Emil - BALOGH, Zoltán - CIGLAN, Marek. Data transition chain. In HLUCHÝ, Ladislav. Tools for acquisition, organisation and presenting of information and knowledge : proceedings in informatics and information technologies. - Košice : Vydavateľstvo STU, Bratislava, 2007. ISBN 978-80-227-2716-7, part 2, P. 79-91.
GATIAL, Emil - BALOGH, Zoltán - HLUCHÝ, Ladislav - VOJTEK, Peter. Identification and acquisition of domain dependent internet resources. In HLUCHÝ, Ladislav. Tools for acquisition, organisation and presenting of information and knowledge : proceedings in informatics and information technologies. - Košice : Vydavateľstvo STU, Bratislava, 2007. ISBN 978-80-227-2716-7, part 2, P. 68-78.
GATIAL, Emil - BALOGH, Zoltán. Identifying, retrieving and determining relevance of heterogenous internet resources. In Research Project Workshop. Tools for acquisition, organization and presenting of information and knowledge : proceedings in informatics and information technology. - Bratislava : Slovak University of Technology Bratislava, 2006. ISBN 80-227-2468-8, s. 15-21.
A.2 Integration Manual
A.2.1 Dependencies
The following software is required in order to install RIDAR:
A.2.2 Installation
Herein we describe how to install the RIDAR tool:
$ NAZOU_HOME=/usr/local/nazou
$ NAZOU_SVN=/tmp
$ mkdir ${NAZOU_HOME}
$ mkdir ${NAZOU_SVN}/ridar
$ svn co https://nazou.fiit.stuba.sk/svn/nazou/ridar ${NAZOU_SVN}/ridar
$ cd ${NAZOU_SVN}/ridar/scripts
$ mysql –uridar –p **** < ./ridar.sql
where –u and –p specifies the database user and password, which must be set up before importing the database structure. The SQL queries for tables RIDAR are in the ridar.sql file.
$ ${ANT_HOME}/bin/ant -DDEPLOY_DIR=${NAZOU_HOME}/ridar dist
where ${ANT_HOME} is the home directory of the Ant software suite.
At this stage you are ready to use the RIDAR tool.
A.2.3 Configuration
The configuration requires only setting the database name (dbname) and user (dbuser) and password (dbpass) in the ${NAZOU_HOME}/ridar/config.properties file.
Default values are set in config.properties file in the following directory ${NAZOU_HOME}/ridar/src/nazou/ridar/
dbname=nazou-ridar
dbuser=nazou
dbpass=*****
Configuration also implies the definition of search plans. Plan is a record in a database which specifies which search engine should be queried by which search expression and how many results should be retrieved and stored by RIDAR. The search expressions syntax may differ for each search engine, therefore the user must make sure that the inserted search expressions are valid for the specified search engine. The structure of the search plan record table is the following:
Field Type
id_schedule bigint(11)
engine varchar(32) …
query_text varchar(255)
pages bigint(20)
processed bigint(20)
An example entry is the following:
id_schedule
engine query_text pages processed
1 yahoo job 6 0
2 google career 3 0
4 google site:www.profesia.sk
list_offers.php3 2 0
The processed field is incremented upon every search engine query and retrieval of a new query result page.
A.2.4 Integration Guide
This section describes how to use the tool. The usage consists of two steps. First is the definition of search terms and expressions and the second is the execution of the RIDAR tool.
Search terms are stored in a database in order to allow plan and scheduling of individual search executions.
RIDAR is simply invoked by executing a proper Java class. Since we need to include several required JAR files for RIDAR to run, there is a special script for invoking RIDAR in the SVN:
$ ${NAZOU_HOME}/ridar/scripts/run_ridar.sh
The content of this script is the following:
#!/bin/bash
JAR=.
for jar in ${NAZOU_HOME}/ridar/lib/*.jar
do
JAR=${JAR}:${jar}
done
cd ${NAZOU_HOME}/ridar
${JAVA_HOME}/bin/java \
-classpath ${JAR} \
nazou.ridar.core.Ridar
This script simply reads in all relevant libraries from the ${NAZOU_HOME}/ridar/lib directory and executes the main nazou.ridar.core.Ridar class.
For some systems it makes sense to execute RIDAR on the regular basis. This can be achieved by placing a record into /etc/crontab file which will result in periodic executions by the cron daemon:
00 * * * * nazou nice -n 19 ${NAZOU_HOME}/ridar/scripts/run_ridar.sh
This record in the crontab file will cause a regular RIDAR execution every hour.
A.3.1 Tool Structure
All the packages are organized under the top nazou.ridar (refered to as n.r in the diagrams) packages. The tool is organized in the following packages:
Figure 1: Organization of packages.
The main class is the n.r.core.Ridar class. The main() method of this class is invoked upon RIDAR tool execution. The class has the following dependencies:
Figure 2: The Ridar class and its dependencies
The next important class is the n.r.core.LinkRecordStorage class. Upon instantiation of this class we must specify both a search engine and a database storage interface implementations:
Figure 3: LinkRecordStorage class and its dependencies.
This class enables to query the specified search engine and to store the queried results through the specified database storage. The structure of the table where queried links are stored is the following:
Field Type
id bigint(20)
title varchar(255)
url text
insert_date timestamp
query_text varchar(255)
engine text
webcrawler tinyint(1)
As for the database interface there are two basic implementations of the n.r.db.DatabaseInterface interface:
Figure 4: MySQL and NazouCM are the two implementations of the DatabaseInterface.
As for the search engine interface there are also two basic implementations of the n.r.searchengines.SearchEngineInterface interface:
Figure 5: Google and Yahoo are the two implementations of interface SearchEngineInterface.
The package n.r.cocoon is a class which was used to demonstrate the RIDAR tool during the first evaluation of the project. The package n.r.results holds two classes which are used to represent and hand over search query results between individual classes.
A.3.2 Method Implementation
The tool upon an execution creates a new Ridar class instance and invokes its run() method. Inside the run() command a search plan is retrieved from a MySQL database. If the search plan is retrieved successfully then the search plan is executed by calling the void execSearchPlan(LinkedList<Plan> searchPlan) method. In this method each plan record from the search plan list is traversed and based on the search engine specified in the database an appropriate search engine query is executed. Each search engine is queried via the int querySearchEngine(String query, long from, long pages) method of the LinkRecordStorage class implementation. LinkRecordStorage class instance is initialized with selected SearchEngineInterface and DatabaseInterface implementation.
A.3.3 Enhancements and Optimizing
In order to enhance RIDAR by an additional search engine, one should implement the SearchEngineInterface class and its public LinkRecordSet search(String query, long pos) method.
In order to enhance RIDAR by an additional database storage engine, one should implement the DatabaseInterfaceInterface class and its methods (see above).
A.4 Manual for Adaptation to Other Domains
A.4.1 Configuring to Other Domain
Not relevant.
A.4.2 Dependencies
No dependencies.