THE METHOD OF DATA SEARCH AND ANALYSIS FROM THE INTERNET RESOURCES FOR THE FORMATION OF ACTUAL REQUIREMENTS FOR CANDIDATES

Authors

DOI:

https://doi.org/10.20998/2413-3000.2018.1277.5

Keywords:

data mining, parsing, comparative identification, web page, expert, vacancy

Abstract

The article deals with the issues of data extraction from Web-resources on the example of gathering information on vacancies. There are three main interacting parts of this process: data source, database, and an expert. The main problematic aspects of the data mining process are the availability of several data sources; data representation in different languages; extraction data from different file formats; multiple updating of repetitive operations and data. The advantages and disadvantages of Web Mining methods were analyzed and defined. They are DOM tree analysis, line parsing, usage of regular expressions, XML parsing and visual approach. Method of DOM tree using XPath was applied in the paper. The method of comparator identification for modeling the data extraction process was proposed. The component, which receives the search topic and the search start page, carries out a thematically directed extraction. The comparator compares the extracted word from the page with the words of the search model. The application of the above-mentioned approach is presented for identifying a vacancy on the job search site. The thesaurus of employers' requirements is developed. Words-indicators of the required vacancies are presented in three languages. The parser work was set up. The parser processes the documents and retrieves the data used to fill a particular data model. The developed module works as follows. It begins to work with obtaining an array of necessary pages from the selected Web site. The next step is the analysis of Web page‘s structure. Then it is necessary to get the content of a specific HTML page, which contains the necessary information for its further retrieval and processing. As a result ―vacancy model‖ is developed. The model should include the following elements: vacancy title; date of adding a job to the site; the city where the applicant needs to work; requirements for the candidate; applicant duties; working conditions. Extraction of requirements, liabilities, and conditions was defined as the most problematic area, whereas the same information can be presented in a different way. In order to unify requirement experts were engaged.

References

Guha R. V., Brickley D, Macbeth S. Schema.org: Evolution of structured data on the web.Commun. ACM. 2008, no. 59 (2), pp. 44–51.

Raghavan S., Garcia-Molina H. Crawling the hidden web. Proceedings of the 27th International Conference on Very Large Data Bases. San Francisco,USA, Morgan Kaufmann Publishers Inc., 2001, pp. 129–138.

Memex (Domain-Specific Search). Available at :www.darpa.mil/program/memex. (accessed 02.11.2017)

W3C XML Query (XQuery). Available at :https://www.w3.org/XML/Query. (accessed 04.11.2017)

XSL Transformations (XSLT) Version 3.0. Available at :https://www.w3.org/TR/xslt. (accessed 15.11.2017)

Apache Nutch. Available at : http://nutch.apache.org/ (accessed 18.11.2017).

Shen W., Doan A., Naughton J. F., Ramakrishnan R. Declarative information extraction using datalog with embedded extraction predicates. Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, 2007, pp. 1033–1044.

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework. Available at :http://scrapy.org. (accessed 25.11.2017).

Nakashole N., Theobald M., Weikum G. Scalableknowledgeharvestingwithhighprecisionandhighrecall. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. New York, NY, USA, ACM, 2011, pp. 227–236.

Xin Luna Dong, Gabrilovich. E, Heitz G. et al. From data fusion to knowledge fusion. Proc. VLDB Endow. 2014, no. 7(10), pp. 881–892.

Etzioni O., Cafarella M., Downey D. et al. Web-scale information extraction in knowitall: (preliminary results). Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA, ACM, 2004, pp. 100–110.

Carlson A., Betteridge J., Kisiel B. et al. Toward an architecture for never-ending language learning. AAAI. AAAI Press, 2010.

Bing Liu, Kevin Chen-Chuan-Chang.Editorial: special issue on web content mining. AcmSigkdd explorations newsletter. 2004, no. 6 (2), pp. 1–4.

AnanthaBarathi B. Structured information extraction system from web pages.MiddleEast Journal of Scientific Research, 2014, no. 19(6), pp. 817–820.

Arasu A., Garcia-Molina H. Extracting structured data from web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 2003, pp. 337–348.

Chia-Hui Chang, Shao-Chen Lui. Iepad: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web.ACM, 2001, pp. 681–688.

Selenium – Web Browser Automation. Available at :http://www.seleniumhq.org. (accessed: 01.12.2017)

Shabanov-Kushnarenko S. Yu. Komparatornaya identifikatsiya protsessov mnogomernoy kolichestvennuy otsenki [Comparative identification of multidimensional quantitative estimation processes]. Saarbruecken, Germany, PalmariumAcademicPublishing, 2015. 217 p.

Shabanov-Kushnarenko S. Yu., Kudkhair AbedTamer. Razrabotka metoda formirovaniya predikatnykh modeley prototipov strukturirovannikh ob‘ektov [Development of the method for the formation of predicate models of structured objects prototypes]. SOI, KhUPS, 2015, no. 9(134), pp. 83–87.

Shabanov-Kushnarenko S. Yu., Kovalenko A. S., Bulaenko D. S. Postroenie ontologii semanticheskogo poiska documentov [Building an ontology of semantic document search]. SOI, KhUPS, no. 10 (135), pp. 156–158.

Published

2018-02-05

Issue

Section

Сборник научных статей