Methods for Web Page Job Offer Extraction

Set of various methods for content extraction from web pages utilizing genetic algorithms, string analysis and expertise.

Institution: Institute of Informatics, Slovak Academy of Sciences
Technologies used: Java 2 Platform Standard Edition, GAJIT (3th party genetic algorithm library), DocConverter
Inputs: collection of original and converted documents
Outputs: collection of webpages encompassing content of job offers.

Tool description

All methods for job offer extraction are bound into one tool located in acquisition chain – ExPoS. The main tool objective is to extract useful information (job offer) from the content of the page pages downloaded and stored in corporate memory by Webcrawler tool which is also part of the acquisition chain in project NAZOU. Extracted information is stored in a webpage with original structure. Afterwards all filtered pages are stored in corporate memory and processed by semantic annotation tool.

Methods description

This experimental tool encompasses the following five methods utilizing different approach how to identify useful information:

Previously mentioned methods are based on the same postulates:

Recursive Substring Analysis (RSA)

Principle:

Note that this method offers exhausting search which is based on strict assumption and is complex. Thus, modifications of this method reducing complexity are essential. One modification is based on reducing of number of processed documents based on binding document into several batches.

Figure 1. Principle of RSA

Due to principle, this algorithm is intended for large patterns. In that case, method will process only higher levels of recursive tree. It does not continue into deeper level. On the contrary, if text string has patterns as small as minimal size of patterns, then algorithm have to process each layer of recursive tree. The maximal number of substrings in recursive tree is

where n is number of words in text, l is minimal length of patterns. Note that, mentioned complexity describes number of teste substings. Consider k as minimal size of pattern, where , than

Note that algorithm performs binary division of the text string into two substrings. This type can lead to complication to identification of pattern, because algorithm can divide substring in the middle of the pattern with high probability. This challenge can be addressed by extension of the substring on both ends (Oravec and Ngyugen 2006). Also, evaluation of frequency using statistical variables is one of the feasible solutions, however this can lead into patterns with useful information, especially in case of large patterns.

Linear Substring Analysis (LSA)

Linear Substring Analysis is based on same assumptions as RSA. However text string is processed rather linearly than recursively. Each i-th processed substring concludes l following words, while first word is i-th word in the text string.

Figure 2. Principle of LSA

The complexity of this methods is . The memory requirements of this method is dramatically reduced against RSA. Disadvantage of this method is overlapped patterns which are produced by the method. Addressing of this issue causes higher complexity of the algorithm. Methods RSA and LSA are competitive. The quality of both methods are comparable, however the question is complexity. From the following complexity comparison

it is obvious that RSA has lower complexity than LSA while a .

Figure 3. Comparison of complexity between LSA and RSA

Common Footer and Header Recognition (CFHR)

This method is based on assumption of document’s vertical structure, i.e. each document from same source have common footer and header. The method has to be based on some statistical method which is robust to small changes in headers and footers. Advantage of this method is its low complexity, however method cannot identify patterns inside of the document. And thus, this method can be used as pre-filtering method for RSA and LSA which reduces information space.

Figure 4. Results of CFHR method.

Genetic Pattern Recognition (GPR)

GPR is searching for pattern utilizing genetic algorithms (Lažanský, 2001 a Sekaj, 2003), where each possible solution is evaluated by fitness function. Each possible solution is represented by chromosome which consists of string of genes, while each gene is one word in document. The order of genes is same as order of words in document. If the gene is part of potential pattern its value is 1 otherwise 0. Fitness function used in genetic algorithm minimizes difference between average frequency of patterns and number of documents and maximizes size of patterns. Genetic algorithm utilizes library GAJIT, which used selection commensurable to quality (Lažanský a Kubalík, 2003). The following figure presents results of method when the first 100 words of document is considered. The quality of the method is 72 %. Genetic search is oriented on points in search space, however utilizing special operators the algorithm can be transformed into searching for regions (Brown a Sumichrast, 2005).

Figure 5. Results of GPR method.

Integration of the tool

Tool is integrated in acquisition chain through the file system and database part of the corporate memory. This integration is independent from the other part of the chain. (Figure 1).

Figure 6. Integration of ExPoS tool in aquistion chain

Input

ExPoS tool is gathers documents from corporate memory, especially from database and file. Documents are already converted into text strings utilizing DocConverter tool.

Output

The output is set of filtered documents stored in corporate memory with appropriate statements in database.

Experimental results

All methods were implemented and tested on downloaded documents from job portal LinkedIn.

Figure 7. Original document containing job offer.

Figure 8. Filtered document containing just job offer.

References

  1. Brown E.C., Sumichrast, R. T. (2005) Evaluating performance advantages of grouping genentic algorithms, Engineering Applications of Artificial Intelligence, Vol. 18., Num. 1., pp. 1-12
  2. Lažanský J. (2001) Evolucní výpocetní techniky, Umela Inteligence (3), Academia, Praha, ISBN 80-200-0472-6
  3. Lažanský J., Kubalík J. (2003) Genetické programování a vybrané problémy evolucních výpoctú, Umela Inteligence (4), Academia, Praha, ISBN 80-200-1044-0
  4. Oravec V., Nguyen G. (2006) Offer Extraction and Separation for Internet Documents, Proceedings in Informatics and Information Technologies (Tools for Acquisition, Organisation and Presenting of Information and Knowledge), Published by Vydavatelstvo STU, ISBN 80-227-2468-8