THE METHOD OF DATA SEARCH AND ANALYSIS FROM THE INTERNET RESOURCES FOR THE FORMATION OF ACTUAL REQUIREMENTS FOR CANDIDATES
DOI:
https://doi.org/10.20998/2413-3000.2018.1277.5Keywords:
data mining, parsing, comparative identification, web page, expert, vacancyAbstract
The article deals with the issues of data extraction from Web-resources on the example of gathering information on vacancies. There are three main interacting parts of this process: data source, database, and an expert. The main problematic aspects of the data mining process are the availability of several data sources; data representation in different languages; extraction data from different file formats; multiple updating of repetitive operations and data. The advantages and disadvantages of Web Mining methods were analyzed and defined. They are DOM tree analysis, line parsing, usage of regular expressions, XML parsing and visual approach. Method of DOM tree using XPath was applied in the paper. The method of comparator identification for modeling the data extraction process was proposed. The component, which receives the search topic and the search start page, carries out a thematically directed extraction. The comparator compares the extracted word from the page with the words of the search model. The application of the above-mentioned approach is presented for identifying a vacancy on the job search site. The thesaurus of employers' requirements is developed. Words-indicators of the required vacancies are presented in three languages. The parser work was set up. The parser processes the documents and retrieves the data used to fill a particular data model. The developed module works as follows. It begins to work with obtaining an array of necessary pages from the selected Web site. The next step is the analysis of Web page‘s structure. Then it is necessary to get the content of a specific HTML page, which contains the necessary information for its further retrieval and processing. As a result ―vacancy model‖ is developed. The model should include the following elements: vacancy title; date of adding a job to the site; the city where the applicant needs to work; requirements for the candidate; applicant duties; working conditions. Extraction of requirements, liabilities, and conditions was defined as the most problematic area, whereas the same information can be presented in a different way. In order to unify requirement experts were engaged.References
Guha R. V., Brickley D, Macbeth S. Schema.org: Evolution of structured data on the web.Commun. ACM. 2008, no. 59 (2), pp. 44–51.
Raghavan S., Garcia-Molina H. Crawling the hidden web. Proceedings of the 27th International Conference on Very Large Data Bases. San Francisco,USA, Morgan Kaufmann Publishers Inc., 2001, pp. 129–138.
Memex (Domain-Specific Search). Available at :www.darpa.mil/program/memex. (accessed 02.11.2017)
W3C XML Query (XQuery). Available at :https://www.w3.org/XML/Query. (accessed 04.11.2017)
XSL Transformations (XSLT) Version 3.0. Available at :https://www.w3.org/TR/xslt. (accessed 15.11.2017)
Apache Nutch. Available at : http://nutch.apache.org/ (accessed 18.11.2017).
Shen W., Doan A., Naughton J. F., Ramakrishnan R. Declarative information extraction using datalog with embedded extraction predicates. Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, 2007, pp. 1033–1044.
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework. Available at :http://scrapy.org. (accessed 25.11.2017).
Nakashole N., Theobald M., Weikum G. Scalableknowledgeharvestingwithhighprecisionandhighrecall. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. New York, NY, USA, ACM, 2011, pp. 227–236.
Xin Luna Dong, Gabrilovich. E, Heitz G. et al. From data fusion to knowledge fusion. Proc. VLDB Endow. 2014, no. 7(10), pp. 881–892.
Etzioni O., Cafarella M., Downey D. et al. Web-scale information extraction in knowitall: (preliminary results). Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA, ACM, 2004, pp. 100–110.
Carlson A., Betteridge J., Kisiel B. et al. Toward an architecture for never-ending language learning. AAAI. AAAI Press, 2010.
Bing Liu, Kevin Chen-Chuan-Chang.Editorial: special issue on web content mining. AcmSigkdd explorations newsletter. 2004, no. 6 (2), pp. 1–4.
AnanthaBarathi B. Structured information extraction system from web pages.MiddleEast Journal of Scientific Research, 2014, no. 19(6), pp. 817–820.
Arasu A., Garcia-Molina H. Extracting structured data from web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 2003, pp. 337–348.
Chia-Hui Chang, Shao-Chen Lui. Iepad: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web.ACM, 2001, pp. 681–688.
Selenium – Web Browser Automation. Available at :http://www.seleniumhq.org. (accessed: 01.12.2017)
Shabanov-Kushnarenko S. Yu. Komparatornaya identifikatsiya protsessov mnogomernoy kolichestvennuy otsenki [Comparative identification of multidimensional quantitative estimation processes]. Saarbruecken, Germany, PalmariumAcademicPublishing, 2015. 217 p.
Shabanov-Kushnarenko S. Yu., Kudkhair AbedTamer. Razrabotka metoda formirovaniya predikatnykh modeley prototipov strukturirovannikh ob‘ektov [Development of the method for the formation of predicate models of structured objects prototypes]. SOI, KhUPS, 2015, no. 9(134), pp. 83–87.
Shabanov-Kushnarenko S. Yu., Kovalenko A. S., Bulaenko D. S. Postroenie ontologii semanticheskogo poiska documentov [Building an ontology of semantic document search]. SOI, KhUPS, no. 10 (135), pp. 156–158.
Downloads
Published
Issue
Section
License
Copyright (c) 2018 Ольга Юріївна Чередніченко, Марина Анатоліївна Гринченко, Артем Вікторович Василенко, Олександр Миколайович Матвєєв
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.