Methods for Web Page Job Offer Extraction
Set of various methods for content extraction from web pages utilizing genetic algorithms, string analysis and expertise.
Institution: | Institute of Informatics, Slovak Academy of Sciences |
Technologies used: | Java 2 Platform Standard Edition, GAJIT (3th party genetic algorithm library), DocConverter |
Inputs: | collection of original and converted documents |
Outputs: | collection of webpages encompassing content of job offers. |
Tool description
All methods for job offer extraction are bound into one tool located in acquisition chain – ExPoS. The main tool objective is to extract useful information (job offer) from the content of the page pages downloaded and stored in corporate memory by Webcrawler tool which is also part of the acquisition chain in project NAZOU. Extracted information is stored in a webpage with original structure. Afterwards all filtered pages are stored in corporate memory and processed by semantic annotation tool.
Methods description
This experimental tool encompasses the following five methods utilizing different approach how to identify useful information:
- RSA (Recursive Substring Analysis) is method based on recursive division of plain text into substring and counting frequency of its occurrence in the set of page with same structure. Patterns with high frequency are removed from the page as undesirable content. This method is systematic with high complexity.
- LSA (Linear Substring Analysis) is modification of RSA method which process plain text in linear not recursive manner. This modification rapidly reduces complexity of the method.
- CHFR (Common Header and Footer Recognition) – this method is based on an assumption that documents are structured vertically. Thus each document has head and footer which is common for every document from same source.
- GPR (Genetic Pattern Recognition) – this approach is based on genetic algorithm, which search content of the documents for common patterns which are same in all documents. The fitness function is based on minimization of count of founded patterns and maximization of patterns.
- documents from the same source has the same structure;
- offers included in documents are different.
Recursive Substring Analysis (RSA)
Principle:
- Documents consisting offers are converted in text documents; thus documents are represented by one single text string.
- Each text string is decomposed to all possible substrings, while order of the words remains.
- Computing of frequency of each substring in documents from same source.
- The biggest substring with same frequency as number of processed documents is labelled as pattern with information which will be removed from document; and thus this pattern cannot be part of the offer.
Note that this method offers exhausting search which is based on strict assumption and is complex. Thus, modifications of this method reducing complexity are essential. One modification is based on reducing of number of processed documents based on binding document into several batches.

Figure 1. Principle of RSA
Due to principle, this algorithm is intended for large patterns. In that case, method will process only higher levels of recursive tree. It does not continue into deeper level. On the contrary, if text string has patterns as small as minimal size of patterns, then algorithm have to process each layer of recursive tree. The maximal number of substrings in recursive tree is

where n is number of words in text, l is minimal length of patterns. Note that, mentioned complexity describes number of teste substings. Consider k as minimal size of pattern, where , than

Note that algorithm performs binary division of the text string into two substrings. This type can lead to complication to identification of pattern, because algorithm can divide substring in the middle of the pattern with high probability. This challenge can be addressed by extension of the substring on both ends (Oravec and Ngyugen 2006). Also, evaluation of frequency using statistical variables is one of the feasible solutions, however this can lead into patterns with useful information, especially in case of large patterns.
Linear Substring Analysis (LSA)
Linear Substring Analysis is based on same assumptions as RSA. However text string is processed rather linearly than recursively. Each i-th processed substring concludes l following words, while first word is i-th word in the text string.

Figure 2. Principle of LSA
The complexity of this methods is . The memory requirements of this method is dramatically reduced against RSA. Disadvantage of this method is overlapped patterns which are produced by the method. Addressing of this issue causes higher complexity of the algorithm. Methods RSA and LSA are competitive. The quality of both methods are comparable, however the question is complexity. From the following complexity comparison
it is obvious that RSA has lower complexity than LSA while a
.

Figure 3. Comparison of complexity between LSA and RSA
Common Footer and Header Recognition (CFHR)
This method is based on assumption of document’s vertical structure, i.e. each document from same source have common footer and header. The method has to be based on some statistical method which is robust to small changes in headers and footers. Advantage of this method is its low complexity, however method cannot identify patterns inside of the document. And thus, this method can be used as pre-filtering method for RSA and LSA which reduces information space.

Figure 4. Results of CFHR method.
Genetic Pattern Recognition (GPR)
GPR is searching for pattern utilizing genetic algorithms (Lažanský, 2001 a Sekaj, 2003), where each possible solution is evaluated by fitness function. Each possible solution is represented by chromosome which consists of string of genes, while each gene is one word in document. The order of genes is same as order of words in document. If the gene is part of potential pattern its value is 1 otherwise 0. Fitness function used in genetic algorithm minimizes difference between average frequency of patterns and number of documents and maximizes size of patterns. Genetic algorithm utilizes library GAJIT, which used selection commensurable to quality (Lažanský a Kubalík, 2003). The following figure presents results of method when the first 100 words of document is considered. The quality of the method is 72 %. Genetic search is oriented on points in search space, however utilizing special operators the algorithm can be transformed into searching for regions (Brown a Sumichrast, 2005).

Figure 5. Results of GPR method.
Integration of the tool
Tool is integrated in acquisition chain through the file system and database part of the corporate memory. This integration is independent from the other part of the chain. (Figure 1).

Figure 6. Integration of ExPoS tool in aquistion chain
Input
ExPoS tool is gathers documents from corporate memory, especially from database and file. Documents are already converted into text strings utilizing DocConverter tool.
Output
The output is set of filtered documents stored in corporate memory with appropriate statements in database.
Experimental results
All methods were implemented and tested on downloaded documents from job portal LinkedIn.

Figure 7. Original document containing job offer.

Figure 8. Filtered document containing just job offer.
References
- Brown E.C., Sumichrast, R. T. (2005) Evaluating performance advantages of grouping genentic algorithms, Engineering Applications of Artificial Intelligence, Vol. 18., Num. 1., pp. 1-12
- Lažanský J. (2001) Evolucní výpocetní techniky, Umela Inteligence (3), Academia, Praha, ISBN 80-200-0472-6
- Lažanský J., Kubalík J. (2003) Genetické programování a vybrané problémy evolucních výpoctú, Umela Inteligence (4), Academia, Praha, ISBN 80-200-1044-0
- Oravec V., Nguyen G. (2006) Offer Extraction and Separation for Internet Documents, Proceedings in Informatics and Information Technologies (Tools for Acquisition, Organisation and Presenting of Information and Knowledge), Published by Vydavatelstvo STU, ISBN 80-227-2468-8