本文已被:浏览 3707次 下载 2209次
Received:May 09, 2018 Revised:June 04, 2018
Received:May 09, 2018 Revised:June 04, 2018
中文摘要: 垂直搜索引擎构建是搜索领域的热点问题之一,应用领域广泛.现有的方法一般都只是对垂直搜索引擎构建中的某一个或几个阶段进行优化,且针对不同网站信息的获取往往需要人工配置操作,较为繁琐.本文在深入研究构建垂直搜索引擎技术的基础上,运用Heritrix、Solr等JAVA开源工具,结合网页正文抽取和完整性词抽取算法,提出了一套自动化构建垂直搜索引擎的方法,对该方法实现各阶段的关键问题展开了研究,并给出相应的优化方案.实践表明,提出的方法与优化方案具有较强的实用性.
中文关键词: 垂直搜索引擎 信息爬取 网页正文抽取 完整词抽取 Heritrix和Solr
Abstract:Vertical search engine has always been a hotspot in the study of searching technique. Dispite a wide range of applications, the mainstream method of vertical search engine still has several flaws. In many cases, only a few stages have been optimized in the construction process of vertical search engine. Also, when obtaining information from websites, most of the methods require manual configuration, which is cumbersome. Based on an in-depth study of the vertical search engine technology, this article presents a method that uses JAVA open source tools such as Heritrix, Solr, combined with the extraction algorithm of web content and integrity word for automatically constructing a vertical search engine. In addition, the article examines the key issues in the various stages of the method's implementation and puts forward the corresponding optimization plan, which are examined to have strong practicality.
keywords: vertical search engine information crawling webpage text extraction full word extraction Heritrix and Solr
文章编号: 中图分类号: 文献标志码:
基金项目:地理信息工程国家重点实验室基金项目(SKLGIE2017-M-4-6);国家自然科学基金青年基金项目(41701537);大学生创新项目(201810489071)
引用文本:
王督,蔡永香,李博涵,刘远刚.油气行业垂直搜索引擎关键问题解决方案.计算机系统应用,2018,27(12):18-24
WANG Du,CAI Yong-Xiang,LI Bo-Han,LIU Yuan-Gang.Critical Problems and Solutions for Vertical Search Engine in Oil and Gas Industry.COMPUTER SYSTEMS APPLICATIONS,2018,27(12):18-24
王督,蔡永香,李博涵,刘远刚.油气行业垂直搜索引擎关键问题解决方案.计算机系统应用,2018,27(12):18-24
WANG Du,CAI Yong-Xiang,LI Bo-Han,LIU Yuan-Gang.Critical Problems and Solutions for Vertical Search Engine in Oil and Gas Industry.COMPUTER SYSTEMS APPLICATIONS,2018,27(12):18-24