A.1 Basic Information
Corporate memory that is developed within the NAZOU project serves as a tool for exchange data, information and knowledge among many components of the system. Corporate memory (CM) plays crucial role in the process of knowledge system development. CM provides two basic tasks: storing of data, information and knowledge and their innovation and maintenance. Data stored in CM are either in the form of their source form (e.g. HTML, pdf, etc.) or in the form of ontologically organized data, data in database system, etc.
This document describes CM software package for the access to databases and file storage.
A.1.1 Basic Terms
WS – Web Service
A.1.2 Method Description
CM provides transparent access to (possibly distributed) data management systems used as a data persistence layer by tools for knowledge acquisition and maintenance developed within project NAZOU.
Corporate memory is organized into three layers (see Figure 1):
l Physical layer that contains a standard file system, database system, and ontological models,
l Manipulation layer that provides access to the stored data and information for other components of knowledge system. That means storing knowledge and information into physical layer, indexing, annotation and organization of stored data and their maintenance before presenting.
l Interaction layer that is responsible for interaction of CM with other system’s components.
Architecture of corporate memory (CM) respects requirements that are given by the distributed character of knowledge system.
Corporate Memory is accessible for other components using relevant client. Each part of CM has client implementation. Core of the CM is running as server and other components can call relevant client method via predefined interface. CM allows local and remote access to the data management systems. The access method (local/remote access) is transparent for the application using the CM and is defined thought CM configuration.
Figure 1: Corporate Memory architecture
The part of corporate memory dedicated to file management provides a way for manipulating the file storage using unified application interface, making actual physical file storage transparent to the user or application. This virtualization allows the information and knowledge management applications to access file storage in a uniform way whether file storage resides on the same computing device or file storage service is hosted at another computing resource.
In the current implementation the corporate memory's file storage is realized as a directory subtree of a file system directory tree. Operations provided by CM for file management are: list, copy, move, delete files and directories in CM; files can be inserted to CM from file input streams or by specifying local file path, or by specifying URL of a document. Files can be retrieved from CM in the form of the input stream, can be saved to defined location in local file system.
The relational database management part of the CM was also designed with virtualization concept in mind, making actual database system and database connection object transparent to the client applications. Advantage of this approach is to make possible to access distributed resources in the same way as the local database system, from the application point of view. This allows us to use CM implementation on a single computer as well as on a set of servers with clearly separated functionalities (e.g. database server, file storage server, ontology management server, set of application servers).
RDB part of CM provides interface to execute basic SQL queries, as well as interfaces to execute predefined queries over common database structures. In the pilot implementation, the common database structures of CM store data about documents that serve as the input data for information and knowledge management applications.
Application developers can provide their own predefined queries for application specific relational data. New predefined queries can be plugged-in the CM, provided they satisfy CM RDB query interface.
As the file storage part, the RDB part of CM is accessible through local java API, XML-RPC call or through Web Service interface. Web Service interface to CM file storage is realized by OGSA-DAI framework.
A.1.3 Scenarios of Use
CM serves as a data-centric integration platform for the information and knowledge processing tools. CM provides transparent access to the data management systems. Developer can use the CM client classes to gain access to the required data management systems. When the required data management system (relational database, file storage) is accessed from the application tool, only proper configuration of CM is required to deploy and use the tool. No application specific configuration for data access is necessary.
In the following, we describe example how NAZOU components work with information and knowledge stored in CM. Data transformation chain is depicted in Figure 2. On each stage information or knowledge about files and offers is accumulated.
Following tools or components transform, generate and manipulate data, information and knowledge in CM in listed order:
l RIDAR (Relevant Internet Data Resource Identification) connects to existing search engines and identify relevant web resources
l WebCrawler and ERID (Estimate Relevance of Internet Documents) recursively explore web resources and store
l DocConverter transforms documents to TXT format.
l OSID (Offer Separation for Internet Documents) extract offers (e.g. job offers) from document. If there is more offers on one document, or if there is only one it select offer without page header, footer, menu, banners and other offer not related stuff.
l RFTS and JDBSearch index text documents and offers; this allow other tools (searching, clustering) to use indexes for further processing.
l Ontea (Ontology based text annotation) annotates text version of offers by ontology individuals which are detected via regular expressions as relevant semantic properties of the offer. Ontea thus create ontology form of offers from file offer version according to defined domain ontology.
l Tools Prescott and faceted browser support presentation, which transforms ontological data to XML and XML is further transformed to HTML via XSL. Indexes, found clusters or tool JobClusterNavigator are also used by presentation to search, categorize and navigate in offers accumulated in CM.
In given example you can see chain of almost independent tools, which are integrated around proposed corporate memory. The memory works with 3 types of data – files, relational data and semantic data, but fundamental conversion between the data types and formats is supported by chain of independent tools.
A.1.4 External Links and Publications
· CIGLAN M., BABÍK M., LACLAVÍK M., BUDÍNSKA I., HLUCHÝ L.: Corporate Memory: A framework for supporting tools for acquisition, organization and maintenance of information and knowledge, Proceedings of 9th International Conference on Information Systems Implementation and Modelling (ISIM '06), April, Přerov, Czech Republic
· HLUCHÝ L., BUDINSKÁ I., NGUYEN G., LACLAVÍK M., BABÍK M., CIGLAN M., GATIAL E., BALOGH Z., and ORAVEC V.: Corporate Memory as a Framework for Data Oriented Integration. In: Tools for Acquisition, Organisation and Presenting of Information and Knowledge. P.Navrat et al. (Eds.), Vydavatelstvo STU, Bratislava, 2006, pp.231-238, ISBN 80-227-2468-8. Workshop 26-28 September, Nizke Tatry, Slovakia.
A.2 Integration Manual
Corporate Memory is developed in Java (Standard Edition 5) and distributed as a source code archive with automatic build script. Access to the functionality of the tool is provided through Java Interface. Optional Web Service interface for presentation layer utilize web service container (such as gt4 container or apache tomcat).
Core component of Corporate Memory uses following libraries:
l Log4J – logging utility (http://logging.apache.org/)
l Junit – Java Unit Tests library (http://www.junit.org/)
l Xerces – XML processing library (http://xerces.apache.org/xerces-j/)
l Apache Ant - Java Build Tool (http://ant.apache.org/)
l JDBC database drivers for supported relational databases
Web Service interfaces of presentation layer use following software:
l GT4 core – Web Service Java core of Globus Toolkit 4
l OGSA-DAI – Data access and integration service (http://www.ogsadai.org.uk/)
l MEDIGRID Data Transfer Service (http://ups.savba.sk/medigrid)
Environmental variable NAZOU_HOME have to be set and pointing to a valid directory in hosting file system.(On Linux based, the environmenatal variable can be set by executing command 'export NAZOU_HOME=<path_to_instalation_directory>')
Unzip Corporate Memory distribution 'CorporateMemory.zip'.
Corporate Memory uses Apach Ant for building and deploying the software. Apache Ant has to be installed and configured on the hosting system.
Build and deploy Corporate Memory using command 'ant deploy' in CM distribution directory. The script will compile and deploy CM to $NAZOU_HOME directory.
Installation of Web Service Interfaces
· Install Globus Toolkit 4 Java core (package and deployment instructions can be found at (footnote 5))
· Install OGSA-DAI data service version 2.2 (package and deployment instructions can be found at http://www.ogsadai.org.uk/). Deploy (according to the instructions provided in OGSA-DAI documentation) data resources for each relational data resource exposed by CM.
· Unzip Corporate Memory distribution 'CorporateMemeory-WSInterface.zip'.
· Build and deploy Corporate Memory WS interface to storage system using command 'ant deploy' in CM WSInterface distribution directory. The script will compile and deploy CM to $NAZOU_HOME directory.
After the CM installation, the CM has to be configured. Several configuration files are deployed in directory $NAZOU_HOME/conf/CorporateMemory.
All the configuration files are prefixed with character '_' after the deployment. The configuration files must be renamed, by removing the '_' prefix (e.g. rename _cmFileStorage.properties to cmFileStorage.properties). Configuration files are prefixed to prevent rewriting of existing configuration in case of re-deployment of CM or deployment of a new version of CM.
This configuration file specify connector and end point of the server part of Corporate Memory. Properties 'filesInteractionMode' and 'rdbInteractionMode' define the interaction method for accessing CM. Properties *Impl define Java classes that implement the interaction method.'filesEndPoint' and 'rdbEndPoint' properties specify the endpoint address of the CM services; they are used in case of remote access to the CM.
This is the only configuration file required for client part of CM.
Example configuration for accessing local data management systems:
Example configuration for accessing remote data management systems via WS interface:
This configuration file is required for server part of CM.
This configuration file is used by the file system access part of the CM. The configuration of <dataResource> element is required. This tag specifies the name of the file system resource within CM and a root directory of the file system part handled by CM.
For example, if the files handled by Corporate Memeory should be stored in directory /opt/nazou/CM on the storage server, the configuration will be:
This means, that e.g. a file stored on storage server's filesystem in /opt/nazou/CM/a.txt will be accessible through CM as /a.txt
This configuration file contains information for accessing relational database resources.
Example of a configuration file for MySQL database:
<property name="driver" value="com.mysql.jdbc.Driver" />
<property name="uri" value="jdbc:mysql://<host>:<port>/<db_schema>" />
<property name="user" value="<db_user_name>" />
<property name="passwd" value="<db_password>"/>
<property name="mediator" value="nazou.cm.core.db.MYSQLMediator" />
Element rdbResource specify the name of the resource that identifies database within CM (<CM_RESOURCE_NAME>). A client application uses this identifier to initialize the CM client classes for the access to the relational data stored in the database. Property driver specify the implementation of JDBC driver that will be used to create the database connections, property URI specifies the connection string of the database; properties user and passwd specify the database user that will be used to create the database connections. Property mediator specify the implementation of the CM sql mediator class that will be used to create and handle the database connections.
Configuration of WS Interface for file storage
Copy DataResourceConfig.xml to $GLOBUS_LOCATION/etc/DataTransferService.
A.2.4 Integration Guide
This section describes how to integrate Corporate Memory with applications using Java API.
First, the CM client has to be created -
This operation crates a new CM client, using configuration files located in $NAZOU_HOME/conf/CorporateMemory/cmFileStorage.properties. If different configuration is required, it can by specified by the constructor
An instance of CMClient class provide access to the CM file storage client and CM database client:
CM File Storage Client Methods (CMFileStorageClient):
To use the CM File Storage Client, the storage resource must be specified first. Once a storage resource available through CM is set active, file manipulation and access methods can be used to work with files stored in the resource.
List available storage resources:
Set active storage resource:
File manipulation methods:
Make new directory:
List files in directory:
Check, if file exists:
Get file checksum:
File access methods:
Insert new file to CM storage from an input stream:
Retrieve file from CM storage
CM Database Client Methods (CMFileStorageClient):
To use the CM Database Client, the database resource must be specified first. Once a database resource available through CM is set active, database manipulation and access methods can be used.
List available database resources:
Set active database resource:
Database manipulation methods:
SQL query statements:
SQL update statements:
Developers can define and implement complex, application specific queries using template query capabilities of CM. This provides a way to simplify complex data access operations from applications using CM. The logic of the data access and manipulation is defined in CM template query and the application use template query to preform complex operations over the database resource.
First, the template query must be prepared; user specify the class of template query, implementing interface RdbmsTemplateQuery
Parameters of the template query are specified as follows:
Finally, the template query is performed:
If CM client is using local connection (not remote, eg. WS access) to database system, the application can retrieve JDBC connection:
A.3 Development Manual
This section describes internal structure of CM and method implementations.
A.3.1 Tool Structure
The CM tool is divided to four packages, CM core package(server part of CM), CM client package, configuration package and test package.
Configuration package (nazou.cm.conf) is used to store the classes able to work with CM configuration files. In current implementation, the classes for the property-styled configuration files and XML based configuration files are provided.
Test package comprise (nazou.cm.test) of JUNIT tests for client and core components of CM.
CM core package (nazou.cm.core) contains server side components of CM. It is divided into two subpackages – for file storage (nazou.cm.core.files) and database access (nazou.cm.core.db) part.
CM client package (nazou.cm.client) contains implementation of the client classes for access to the file storage and database part of CM. Different client classes must be provided for different access method to the CM (e.g. different client implementation are provided for local and remote WS-based access).
A.3.2 Method Implementation
This section describes the implementation of the CM methods.
CM core components
CM database core component:
Principal class of this component is CMRdbStorage class that maintains the list of database resources configured for the CM server part (in rdmResources.xml configuration file). For each resource, an instance of the SqlMediator is kept. When an operation is invoked by CM client, appropriate resource SqlMediator is selected and operation is executed over specified resource.
Interface SqlMediator (Figure 3) specify the methods of CM classes enabling the access to the relational databases. Two implementations of SqlMediator interface are provided in current distribution – mediators for MySQL database (MYSQLMediator class) and Hsql (HSQLMediator class).
Another class present in the package, is a SqlPump class providing the capability of replicating data and data structures between different relational resources of CM.
CM file storage component:
Main class of the package, (CMFilesStorage class) is responsible for server side data manipulation and streaming the data to and from clients for data access purposes.
The functionality of CM core components is exposed via local Java API for loacal access and through WS based interface. WS for database access is provided by OGSA-DAI service, while the WS for file storage is realized by utilization of customized version of MEDIGRID Data Transfer and MEDIGRID Resource Manager services.
CM client components
CM database client:
Interface for CM database clients is defined in CMDBClient interface. When new interaction method with CM is required (currently local access and remote access via WS are provided), a new client implementation has to be provided.
Database client package contains also subpackage for template queries definition (nazou.cm.client.db.templateQueries). When a new template query is required, new class implementing this interface has to be provided.
CM file storage client:
Interface for CM file storage clients is defined in CMFileStorageClient interface. When new interaction method with CM is required (currently local access and remote access via WS are provided), a new client implementation has to be provided.
A.4 Manual for Adaptation to Other Domains
Corporate Memory is a generic tool and no customization/reconfiguration is required for adoption to other domains.