基于半监督学习的恶意URL检测方法

doi:10.15888/j.cnki.csa.007461

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月1日 10:41 星期二

首页 > 过刊浏览>2020年第29卷第11期 >11-20. DOI:10.15888/j.cnki.csa.007461

PDF HTML阅读 XML下载导出引用引用提醒

基于半监督学习的恶意URL检测方法
DOI:
                        10.15888/j.cnki.csa.007461
                    
CSTR:
                        
                    
作者:
                        麻瓯勃麻瓯勃
杭州师范大学 杭州国际服务工程学院, 杭州 311121
在期刊界中查找
在百度中查找
在本站中查找
刘雪娇刘雪娇
杭州师范大学 杭州国际服务工程学院, 杭州 311121
在期刊界中查找
在百度中查找
在本站中查找
唐旭栋唐旭栋
杭州师范大学 杭州国际服务工程学院, 杭州 311121
在期刊界中查找
在百度中查找
在本站中查找
周宇轩周宇轩
杭州师范大学 杭州国际服务工程学院, 杭州 311121
在期刊界中查找
在百度中查找
在本站中查找
胡亦承胡亦承
杭州师范大学 杭州国际服务工程学院, 杭州 311121
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:浙江省自然科学基金(LY19F020021); 浙江省大学生科技创新活动计划(新苗人才计划) (2019R426035)

Malicious URL Detection Based on Semi-Supervised Learning

Author:

MA Ou-Bo
MA Ou-Bo
Hangzhou Institute of Service Engineering, Hangzhou Normal University, Hangzhou 311121, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Xue-Jiao
LIU Xue-Jiao
Hangzhou Institute of Service Engineering, Hangzhou Normal University, Hangzhou 311121, China
在期刊界中查找
在百度中查找
在本站中查找
TANG Xu-Dong
TANG Xu-Dong
Hangzhou Institute of Service Engineering, Hangzhou Normal University, Hangzhou 311121, China
在期刊界中查找
在百度中查找
在本站中查找
ZHOU Yu-Xuan
ZHOU Yu-Xuan
Hangzhou Institute of Service Engineering, Hangzhou Normal University, Hangzhou 311121, China
在期刊界中查找
在百度中查找
在本站中查找
HU Yi-Cheng
HU Yi-Cheng
Hangzhou Institute of Service Engineering, Hangzhou Normal University, Hangzhou 311121, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [37]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

检测恶意URL对防御网络攻击有着重要意义. 针对有监督学习需要大量有标签样本这一问题, 本文采用半监督学习方式训练恶意URL检测模型, 减少了为数据打标签带来的成本开销. 在传统半监督学习协同训练(co-training)的基础上进行了算法改进, 利用专家知识与Doc2Vec两种方法预处理的数据训练两个分类器, 筛选两个分类器预测结果相同且置信度高的数据打上伪标签(pseudo-labeled)后用于分类器继续学习. 实验结果表明, 本文方法只用0.67%的有标签数据即可训练出检测精确度(precision)分别达到99.42%和95.23%的两个不同类型分类器, 与有监督学习性能相近, 比自训练与协同训练表现更优异.

关键词:恶意URL检测;半监督学习;协同训练改进算法;Doc2Vec;分类器训练

Abstract:

Detecting malicious URL is important for defending against cyber attacks. In view of the problem that supervised learning requires a large number of labeled samples, this study uses a semi-supervised learning method to train malicious URL detection models, which reduces the cost overhead of labeling data. We propose an improved algorithm based on the traditional co-training. Two kinds of classifiers are trained by using expert knowledge and Doc2Vec pre-processed data, and the data with the same prediction result and the high confidence of the two classifiers are screened and used for classifiers learning after being pseudo-labeled. The experimental results show that the proposed method can train two different types of classifiers with detection precision of 99.42% and 95.23% with only 0.67% of labeled data, which is similar to supervised learning performance and performs better than self-training and co-training.

Key words:malicious URL detection;semi-supervised learning;co-training improvement algorithm;Doc2Vec;classifier training

参考文献

[1] Kaspersky Security Bulletin 2018. https://securelist.com/kaspersky-security-bulletin-2018-statistics/89145/. (2018-09-04).

[2] Sahoo D, Liu C, Hoi SCH. Malicious URL detection using machine learning: A survey. arXiv preprint arXiv: 1701.07179, 2017.

[3] Prakash P, Kumar M, Kompella RR, et al. PhishNet: Predictive blacklisting to detect phishing attacks. 2010 Proceedings IEEE INFOCOM. San Diego, CA, USA. 2010. 1-5.

[4] Tsai CF, Hsu YF, Lin CY, et al. Intrusion detection by machine learning: A review. Expert Systems with Applications, 2009, 36(10): 11994-12000. [doi: 10.1016/j.eswa.2009.05.029

[5] Le H, Pham Q, Sahoo D, et al. URLNet: Learning a URL representation with deep learning for malicious URL detection. arXiv preprint arXiv: 1802.03162, 2018.

[6] Sommer R, Paxson V. Outside the closed world: On using machine learning for network intrusion detection. Proceedings of 2010 IEEE Symposium on Security and Privacy. Berkeley/Oakland, CA, USA. 2010. 305-316.

[7] Sinclair C, Pierce L, Matzner S. An application of machine learning to network intrusion detection. Proceedings 15th Annual Computer Security Applications Conference. Phoenix, AZ, USA. 1999. 371-377.

[8] Buczak AL, Guven E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 2016, 18(2): 1153-1176

[9] 吴海滨, 张冬梅. 基于上下文信息的恶意URL检测技术. 软件, 2019, 40(1): 63-68. [doi: 10.3969/j.issn.1003-6970.2019.01.013

[10] 沙泓州, 刘庆云, 柳厅文, 等. 恶意网页识别研究综述. 计算机学报, 2016, 39(3): 529-542. [doi: 10.11897/SP.J.1016.2016.00529

[11] Warrender C, Forrest S, Pearlmutter B. Detecting intrusions using system calls: Alternative data models. Proceedings of the 1999 IEEE Symposium on Security and Privacy. Oakland, CA, USA. 1999. 133-145.

[12] Mao GJ, Wu XD, Zhu XQ, et al. Mining maximal frequent itemsets from data streams. Journal of Information Science, 2007, 33(3): 251-262. [doi: 10.1177/0165551506068179

[13] Garera S, Provos N, Chew M, et al. A framework for detection and measurement of phishing attacks. Proceedings of the 2007 ACM workshop on Recurring Malcode. Alexandria, VA, USA. 2007. 1-8.

[14] Sinha S, Bailey M, Jahanian F. Shades of grey: On the effectiveness of reputation-based “blacklists”. Proceedings of 2008 3rd International Conference on Malicious and Unwanted Software. Fairfax, VA, USA. 2008. 57-64.

[15] Xu L, Zhan ZX, Xu SH, et al. Cross-layer detection of malicious websites. Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy. San Antonio, TX, USA. 2013. 141-152.

[16] Huang HJ, Qian L, Wang YJ. A SVM-based technique to detect phishing URLs. Information Technology Journal, 2012, 11(7): 921-925. [doi: 10.3923/itj.2012.921.925

[17] Hou YT, Chang YM, Chen T, et al. Malicious web content detection by machine learning. Expert Systems with Applications, 2010, 37(1): 55-60. [doi: 10.1016/j.eswa.2009.05.023

[18] Canali D, Cova M, Vigna G, et al. Prophiler: A fast filter for the large-scale detection of malicious web pages. Proceedings of the 20th International Conference on World Wide Web. Hyderabad, India. 2011. 197-206.

[19] Lee S, Kim J. WarningBird: Detecting suspicious URLs in Twitter Stream. NDSS. 2012. 1-13.

[20] Zhou ZH, Li M. Semi-supervised regression with co-training. Proceedings of the 19th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA. 2005. 908-913.

[21] Zhou ZH, Li M. Semisupervised regression with cotraining-style algorithms. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1479-1493. [doi: 10.1109/TKDE.2007.190644

[22] 梁吉业, 高嘉伟, 常瑜. 半监督学习研究进展. 山西大学学报(自然科学版), 2009, 32(4): 528-534

[23] 周志华. 基于分歧的半监督学习. 自动化学报, 2013, 39(11): 1871-1878

[24] McClosky D, Charniak E, Johnson M. Effective self-training for parsing. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Stroudsburg, PA, USA. 2006. 152-159.

[25] Rosenberg C, Hebert M, Schneiderman H. Semi-supervised self-training of object detection models. Proceedings of 2005 7th IEEE Workshops on Applications of Computer Vision. Breckenridge, CO, USA. 2005. 29-36.

[26] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory. New York, NY, USA. 1998. 92-100.

[27] Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training. Proceedings of the 9th International Conference on Information and Knowledge Management. New York, NY, USA. 2000. 86-93.

[28] Zhou ZH. Disagreement-based semi-supervised learning. Acta Automatica Sinica, 2013, 39(11): 1871-1878. [doi: 10.3724/SP.J.1004.2013.01871

[29] Sindhwani V, Niyogi P, Belkin M. A co-regularized approach to semi-supervised learning w瑩慴瑨椠潭湵獬?潩晰?獥攠湶瑩敥湷捳攮猠?慲湯摣?摥潤捩畮浧敳渠瑯獦??偨牥漠挲攲敮摤椠湗杯獲?潳晨?瑰栠敯???獥瑡??湩瑮敧爠湷慩瑴楨漠湍慵汬??潰湬晥攠牖敩湥捷敳?漠湃??湢瑲敩牤湧慥琬椠潕湋愮氠′?漰渵昮攠爸攲渴挭攸″漱渮??慲挾桛椳渰敝??敲慥牦湥楬湤朠???敇椦樣椲渲朸???桮楥湲愠??㈠こ??????????ㄠ????扴爠?孬?土嵩￣圮愠汅汦慦捩档?????呣潯瀭楲捥?浵潬摡敲汩楳湥杤???敡祳潴渠摳?扵慡杲?潳映?睥潧牲摥獳??偯牮漮挠敐敲摯楣湥来獤?潮晧?琠桯敦???牥搠′?湲瑤攠牉湮慴瑥楲潮湡慴汩??潡湬映敃牯敮湦捥敲?潮湣??慯据栠楍湡散??敮慥爠湌楥湡杲??乮敧眮?奎潥牷欠??乲奫??啎卙????ぁ???‰?????????戴爴?嬼??崾??椱歝漠汆潡癲?呵???栠敊湄?????潲牤牯慯摮漠?????楥?敧琠?慙氬??楩????晡晬椼振楩放渮琠?敷獯琠楶浩慥瑷椠潬湥?潲普?睮潧爺搠?牖敍瀭爲敋猬攠湴瑨慥瑯楲潹渠獡?楤渠?癲敡捣瑴潩牣?献瀠慐捲敯??健牤潩据敧敳搠楯湦朠獴?潥映?琸桴敨??獮瑴??湮瑡整物湯慮瑡楬漠湃慯汮??潲湥普散牥攠湯据攠?潥湵??敬愠牉湮楦湯杲?剡整灩牯敮猠敐湲瑯慣瑥楳潳湩獮??卓捹潳瑴瑥獭摳愮氠敃???婲??啧卥?????ㄠ???戵爮?嬳??崭″?椲欮漼汢潲瘾?吳??匠畓瑲獩此敨癡敲牡?????桋敡湫?????楍?攠瑁?愠汩??楯????楩獯瑮爠楴扨略瑯敲摥?物散瀠牦敲獡敭湥瑷慯瑲楫漠湦獯?漠晭?睬潴物搭獶?慥湷搠?灥桡牲慮獩敮獧?愠湐摲?瑣桥敥楤物?捧潳洠灯潦猠楴瑨楥漠渲愱汳楴琠祁??偵牡潬挠敃敯摮楦湥杲獥?潣晥?瑯桮攠???瑲桮??湧琠敔牨湥慯瑲楹漮渠慈汥??潩湮晫敩爬攠湆捩敮?潡湮?丮攠甲爰愰永??渴昰漳爭洴愱琴椮漼湢?倾牛漳挳敝猠獇楯湬杤?卡祮猠瑓敁洬猠??副敵搠??漠潅歮??乮奣??啧匠????ひ?????ㄠ???????? with unlabeled data. Proceedings of the 17th International Conference on Machine Learning. San Francisco, CA, USA. 2000. 327-334.

[34] Fushiki T. Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 2011, 21(2): 137-146. [doi: 10.1007/s11222-009-9153-8

[35] Zhou ZH, Li M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529. [doi: 10.1109/TKDE.2005.186

[36] Wang W, Zhou ZH. Co-training with insufficient views. Proceedings of the 5th Asian Conference on Machine Learning. Canberra, Australia. 2013. 467-482.

[37] Ma J, Saul LK, Savage S, et al. Learning to detect malicious URLs. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 30

[38] 徐冬冬, 谢统义, 万卓昊, 等. 基于TF-IDF文本向量化的SQL注入攻击检测. 广西大学学报(自然科学版), 2018, 43(5): 1818-1826

[39] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. The Journal of Machine Learning Research, 2003, 3: 1137-1155

[40] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011, 12: 2493-2537

[41] Le Q, Mikolov T. Distributed represen???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

引用本文

麻瓯勃,刘雪娇,唐旭栋,周宇轩,胡亦承.基于半监督学习的恶意URL检测方法.计算机系统应用,2020,29(11):11-20

复制

文章指标

点击次数:1133
下载次数: 3192
HTML阅读次数: 2158
引用次数: 0

历史

收稿日期:2019-11-18
最后修改日期:2019-12-11
录用日期:
在线发布日期: 2020-10-30
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码