A              DocConverter – Batch Document Conversion Tool

A.1          Basic Information

DocConverter tool is a helper tool for batch conversion of documents in different format (html, pdf, doc, etc.) to plain text format. The tool prepares the documents for the processing of other tools in NAZOU project, working with plain text files.

A.1.1      Basic Terms

A.1.2      Method Description

The tool utilizes legacy document format conversion tools for batch conversion of the different document (in different formats) from the input directory structure to the plain text. The method is integrated with NAZOU Corporate Memory (CM); it access the documents stored in CM file storage. In addition, document metadata (creation date, path to original file, path to converted file, etc.) are stored in the relational part of the CM along with the language of the document, identified by NAZOU Nalit tool. The tool is also integrated with NAZOU WebCrawler tool that acquire the documents from Internet and store the document metadata in relational resource of CM. DocConverter can operate in conjunction with WebCrawler and perform batch conversion of the downloaded or updated documents by pulling the metainformation produced by WebCrawler from CM.

A.1.3      Scenarios of Use

Tool is designed for batch conversion of documents to plain text format. Tool can be invoked in two ways:

l  specifying input and output directories in CM file storage, tool converts documents from input directory and place plain text versions in output directory.

l  using tool in conjunction with WebCrawler, DocConverter transforms new documents acquired by WebCrawler tool.

A.1.4      External Links and Publications

A.2          Integration Manual

DocConverter is developed in Java (Standard Edition 5) and distributed as a source code archive with automatic build script. Access to the functionality of the tool is provided through Java Interface. Tool utilize existing tool for conversion of different document formats to plain text.

A.2.1      Dependencies

Libraries:

l  Junit[1] – Java Unit Tests library

l  NAZOU Nalit library

l  NAZOU Corporate Memory library

Conversion tools (list of conversion tools tested within NAZOU project)

·        vilistextum[2] – html to plain text converter

·        pdftotext[3] – pdf to plain text converter – part of xpdf project

A.2.2      Installation

Environmental variable NAZOU_HOME have to be set and pointing to a valid directory in hosting file system.(On Linux based, the environmental variable can by set by executing command 'export NAZOU_HOME=<path_to_instalation_directory>')

Unzip DocConverter distribution 'DocConverter.zip'.

DocConverter uses Apach Ant for building and deploying the software. Apache Ant has to be installed and configured on the hosting system.

Build and deploy DocConverter using command 'ant deploy' in DocConverter distribution directory. The script will compile and deploy DocConverter to $NAZOU_HOME directory.

A.2.3      Configuration

Tool configuration file is located in

 $NAZOU_HOME/conf/DocConverter/DocConverter.properties

Configuration file contains list of properties specifying the file extension and related conversion utility. The (file extension, conversion tools) properties start with the character '.':

    .<file_extension>=<conversion_tool>

e.g. the conversion from .html files to plaintext files using vilistextum utility can be configured as follows:

    .html=vilistextum -a -n

External conversion tools are invoked by DocConverter tool; external tools must be executable via command line interface with parameters <input_file> <output_file>.

E.g. .html=vilistextum -a -n configuration specify that html documents will be converted by invoking command  vilistextum -a -n <input_file> <output_file>.

Additional obligatory configuration property is fileStorageResource that specifies the CM file storage resource. The default value is:

    fileStorageResource=nazouStorage

A.2.4      Integration Guide

This section describes how to integrate DocConcerter with applications using Java API.

Initialization:

    DocConverter con;

    con = new DocConverter();

Conversion of files in specific directory:

    con.convertDir(String <input_dir_in_CM>, String <output_dir_in_CM>);

Conversion of files acquired by NAZOU WebCrawler tool:

    con.prepareDocumentsFromWebCrawler();

A.3          Development Manual

This section describes internal structure of DocCoverter tool and method implementations.

A.3.1      Tool Structure

The tool consists of a package containing the core functionality classes and a subpackage with JUNIT tests.

Package (nazou.da.docConversion) contains three classes.

Class DocConverter provides core functionality of the tool and allows converting documents to plain text (using batch document conversion, or single document conversion methods).

Class CMDocs provides integration layer for DocConverter class to interact with Corporate Memory. The metadata about converted documents are kept in relational part of CM.. In addition, the interaction with CM is required for integration with NAZOU WebCrawler tool; the metadata about downloaded documents produced by WebCrawler has to be accessible by DocConverer.

Class NalitCmIntegrator provides integration with NAZOU Nalit, language identification tool.

A.3.2      Method Implementation

Main functionality of the tool is implemented in DocConverter class. Provided methods are the following:

Add additional conversion tool:

     addDocTypeConverter(String docType, String converter)

 

Convert all documents in given subdirectory structure:

    convertDir(String parInDir, String parOutDir)

 

Convert documents in a single directory:

     convertDocsInDir(String parInDir, String parOutDir)

 

Convert a single document:

    convertDocument(String doc, String toDir)

The document conversion procedure involves:

1.      validate of input parameters (existence of the given locations in CM)

2.      get the document type

3.      determine appropriate conversion tool

4.      run conversion tool in runtime environment

5.      determine the language of the document (from plain text conversion)

6.      insert the metadata record about the document in CM

 

Convert documents downloaded by NAZOU WebCrawler tool:

    prepareDocumentsFromWebCrawler()

A.4          Manual for Adaptation to Other Domains

DocConverter is a generic tool and no customization/reconfiguration is required for adoption to other domains.

 



[1]  http://www.junit.org/

[2] http://bhaak.dyndns.org/vilistextum

[3] http://www.foolabs.com/xpdf/download.html

[4]  http://www.winfield.demon.nl/