本文已被:浏览 1496次 下载 1782次
Received:July 19, 2019 Revised:August 22, 2019
Received:July 19, 2019 Revised:August 22, 2019
中文摘要: 互联网网页所形成的主题孤岛严重影响了搜索引擎系统的主题爬虫性能,通过人工增加大量的初始种子链接来发现新主题的方法无法保证主题网页的全面性.在分析传统基于内容分析、基于链接分析和基于语境图的主题爬行策略的基础上,提出了一种基于动态隧道技术的主题爬虫爬行策略.该策略结合页面主题相关度计算和URL链接相关度预测的方法确定主题孤岛之间的网页页面主题相关性,并构建层次化的主题判断模型来解决主题孤岛之间的弱链接问题.同时,该策略能有效防止主题爬虫因采集过多的主题无关页面而导致的主题漂移现象,从而可以实现在保持主题语义信息的爬行方向上的动态隧道控制.实验过程利用主题网页层次结构检测页面主题相关性并抽取“体育”主题关键词,然后以此对采集的主题网页进行索引查询测试.结果表明,基于动态隧道技术的爬行策略能够较好的解决主题孤岛问题,明显提升了“体育”主题搜索引擎的准确率和召回率.
Abstract:Topic island on Internet Web pages has seriously affected the performance of focused crawlers. The metric of setting more initial links to find new topics cannot guarantee the comprehensiveness of Web pages. On the basis of analyzing typical crawling strategies and taking into account the hierarchy of topic relevant, we propose a crawling strategy using dynamic tunneling. The crawling strategy uses the tunneling technology based on the topic of Web pages to discover new topics, and constructs a hierarchical topic model to solve the problem of weak link between two topic islands. Meanwhile, the strategy can effectively prevent topic drift caused by collecting too many topic-independent pages, thus dynamic controls the tunneling depth in the crawling direction with the semantic information of the topic maintained. Experimental results show that the proposed method can better address the topic island issue, thereby enhancing the recall of focused search engines.
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(61602374)
引用文本:
姜琨,朱磊,王一川.基于动态隧道技术的主题爬行策略.计算机系统应用,2020,29(3):253-260
JIANG Kun,ZHU Lei,WANG Yi-Chuan.Dynamic Tunneling Heuristic for Focused Crawling.COMPUTER SYSTEMS APPLICATIONS,2020,29(3):253-260
姜琨,朱磊,王一川.基于动态隧道技术的主题爬行策略.计算机系统应用,2020,29(3):253-260
JIANG Kun,ZHU Lei,WANG Yi-Chuan.Dynamic Tunneling Heuristic for Focused Crawling.COMPUTER SYSTEMS APPLICATIONS,2020,29(3):253-260