###
计算机系统应用英文版:2017,26(2):207-211
本文二维码信息
码上扫一扫!
基于多维度特征的不良网站检测
(1.中国科学院大学, 北京 100049;2.中国科学院计算机网络信息中心, 北京 100190;3.中国互联网络信息中心 互联网络域名管理技术国家工程实验室, 北京 100190)
Illegitimate Website Detection Based on Multi-Dimensional Features
(1.University of Chinese Academy of Sciences, Beijing 100049, China;2.Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;3.National Engineering Laboratory of Internet Domain Name Management Technology, China Internet Network Information Center, Beijing 100190, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1640次   下载 2666
Received:May 17, 2016    Revised:June 27, 2016
中文摘要: 目前主要是通过基于URL(Uniform Resource Locator)、关键词、图片等网页内容为特征的机器学习方法进行不良网站检测.但是,不良网站制作者也会通过更换URL,避免常见不良关键词的使用,对搜索爬虫隐藏图片等做法来规避检测,这使得基于内容的检测方法会有漏检的情况.为了更准确的检测出此类网站,本文提出了注册、解析方面的相关特征,并通过最主流的机器学习方法构建了检测模型.用模型预测新数据集,结果证明,基于解析和注册特征的检测方法可以有效的在网站集合中检测出前文提到的不良网站,并且对于一般不良也依然能够准确识别.本次研究为不良网站的检测研究提供了又一思路.
中文关键词: 解析  注册  不良网站  检测
Abstract:The Web Information Extraction and Knowledge Presentation System is proposed to extract information from data intensive web pages. It downloads dynamic web pages, based on a knowledge database, changes them to XML documents after preprocessing, finds repeated patterns from them, by using a PAT-array based pattern discovery algorithm, recognizes their data display structure models, automatically based on the repeated patterns and an ontology-based keyword library, and then extracts the data and stores them in the knowledge database with the object-relational mapping technology of XML. Through these steps, web data is extracted automatically, and the knowledge database is also expanded automatically. Experiments on the traffic information auto-extraction and mixed traffic travel schemes auto-creation system showed that the system has high precision and is adaptive to web pages in different domains with different structures.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
田双柱,陈勇,延志伟,李晓东.基于多维度特征的不良网站检测.计算机系统应用,2017,26(2):207-211
TIAN Shuang-Zhu,CHEN Yong,YAN Zhi-Wei,LI Xiao-Dong.Illegitimate Website Detection Based on Multi-Dimensional Features.COMPUTER SYSTEMS APPLICATIONS,2017,26(2):207-211