本文已被:浏览 1640次 下载 2666次
Received:May 17, 2016 Revised:June 27, 2016
Received:May 17, 2016 Revised:June 27, 2016
中文摘要: 目前主要是通过基于URL(Uniform Resource Locator)、关键词、图片等网页内容为特征的机器学习方法进行不良网站检测.但是,不良网站制作者也会通过更换URL,避免常见不良关键词的使用,对搜索爬虫隐藏图片等做法来规避检测,这使得基于内容的检测方法会有漏检的情况.为了更准确的检测出此类网站,本文提出了注册、解析方面的相关特征,并通过最主流的机器学习方法构建了检测模型.用模型预测新数据集,结果证明,基于解析和注册特征的检测方法可以有效的在网站集合中检测出前文提到的不良网站,并且对于一般不良也依然能够准确识别.本次研究为不良网站的检测研究提供了又一思路.
Abstract:The Web Information Extraction and Knowledge Presentation System is proposed to extract information from data intensive web pages. It downloads dynamic web pages, based on a knowledge database, changes them to XML documents after preprocessing, finds repeated patterns from them, by using a PAT-array based pattern discovery algorithm, recognizes their data display structure models, automatically based on the repeated patterns and an ontology-based keyword library, and then extracts the data and stores them in the knowledge database with the object-relational mapping technology of XML. Through these steps, web data is extracted automatically, and the knowledge database is also expanded automatically. Experiments on the traffic information auto-extraction and mixed traffic travel schemes auto-creation system showed that the system has high precision and is adaptive to web pages in different domains with different structures.
keywords: analysis registration illegitimate website detection
文章编号: 中图分类号: 文献标志码:
基金项目:
引用文本:
田双柱,陈勇,延志伟,李晓东.基于多维度特征的不良网站检测.计算机系统应用,2017,26(2):207-211
TIAN Shuang-Zhu,CHEN Yong,YAN Zhi-Wei,LI Xiao-Dong.Illegitimate Website Detection Based on Multi-Dimensional Features.COMPUTER SYSTEMS APPLICATIONS,2017,26(2):207-211
田双柱,陈勇,延志伟,李晓东.基于多维度特征的不良网站检测.计算机系统应用,2017,26(2):207-211
TIAN Shuang-Zhu,CHEN Yong,YAN Zhi-Wei,LI Xiao-Dong.Illegitimate Website Detection Based on Multi-Dimensional Features.COMPUTER SYSTEMS APPLICATIONS,2017,26(2):207-211