基于半监督学习的恶意URL检测方法
作者:
基金项目:

浙江省自然科学基金(LY19F020021); 浙江省大学生科技创新活动计划(新苗人才计划) (2019R426035)


Malicious URL Detection Based on Semi-Supervised Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [37]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    检测恶意URL对防御网络攻击有着重要意义. 针对有监督学习需要大量有标签样本这一问题, 本文采用半监督学习方式训练恶意URL检测模型, 减少了为数据打标签带来的成本开销. 在传统半监督学习协同训练(co-training)的基础上进行了算法改进, 利用专家知识与Doc2Vec两种方法预处理的数据训练两个分类器, 筛选两个分类器预测结果相同且置信度高的数据打上伪标签(pseudo-labeled)后用于分类器继续学习. 实验结果表明, 本文方法只用0.67%的有标签数据即可训练出检测精确度(precision)分别达到99.42%和95.23%的两个不同类型分类器, 与有监督学习性能相近, 比自训练与协同训练表现更优异.

    Abstract:

    Detecting malicious URL is important for defending against cyber attacks. In view of the problem that supervised learning requires a large number of labeled samples, this study uses a semi-supervised learning method to train malicious URL detection models, which reduces the cost overhead of labeling data. We propose an improved algorithm based on the traditional co-training. Two kinds of classifiers are trained by using expert knowledge and Doc2Vec pre-processed data, and the data with the same prediction result and the high confidence of the two classifiers are screened and used for classifiers learning after being pseudo-labeled. The experimental results show that the proposed method can train two different types of classifiers with detection precision of 99.42% and 95.23% with only 0.67% of labeled data, which is similar to supervised learning performance and performs better than self-training and co-training.

    参考文献
    [1] Kaspersky Security Bulletin 2018. https://securelist.com/kaspersky-security-bulletin-2018-statistics/89145/. (2018-09-04).
    [2] Sahoo D, Liu C, Hoi SCH. Malicious URL detection using machine learning: A survey. arXiv preprint arXiv: 1701.07179, 2017.
    [3] Prakash P, Kumar M, Kompella RR, et al. PhishNet: Predictive blacklisting to detect phishing attacks. 2010 Proceedings IEEE INFOCOM. San Diego, CA, USA. 2010. 1-5.
    [4] Tsai CF, Hsu YF, Lin CY, et al. Intrusion detection by machine learning: A review. Expert Systems with Applications, 2009, 36(10): 11994-12000. [doi: 10.1016/j.eswa.2009.05.029
    [5] Le H, Pham Q, Sahoo D, et al. URLNet: Learning a URL representation with deep learning for malicious URL detection. arXiv preprint arXiv: 1802.03162, 2018.
    [6] Sommer R, Paxson V. Outside the closed world: On using machine learning for network intrusion detection. Proceedings of 2010 IEEE Symposium on Security and Privacy. Berkeley/Oakland, CA, USA. 2010. 305-316.
    [7] Sinclair C, Pierce L, Matzner S. An application of machine learning to network intrusion detection. Proceedings 15th Annual Computer Security Applications Conference. Phoenix, AZ, USA. 1999. 371-377.
    [8] Buczak AL, Guven E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 2016, 18(2): 1153-1176
    [9] 吴海滨, 张冬梅. 基于上下文信息的恶意URL检测技术. 软件, 2019, 40(1): 63-68. [doi: 10.3969/j.issn.1003-6970.2019.01.013
    [10] 沙泓州, 刘庆云, 柳厅文, 等. 恶意网页识别研究综述. 计算机学报, 2016, 39(3): 529-542. [doi: 10.11897/SP.J.1016.2016.00529
    [11] Warrender C, Forrest S, Pearlmutter B. Detecting intrusions using system calls: Alternative data models. Proceedings of the 1999 IEEE Symposium on Security and Privacy. Oakland, CA, USA. 1999. 133-145.
    [12] Mao GJ, Wu XD, Zhu XQ, et al. Mining maximal frequent itemsets from data streams. Journal of Information Science, 2007, 33(3): 251-262. [doi: 10.1177/0165551506068179
    [13] Garera S, Provos N, Chew M, et al. A framework for detection and measurement of phishing attacks. Proceedings of the 2007 ACM workshop on Recurring Malcode. Alexandria, VA, USA. 2007. 1-8.
    [14] Sinha S, Bailey M, Jahanian F. Shades of grey: On the effectiveness of reputation-based “blacklists”. Proceedings of 2008 3rd International Conference on Malicious and Unwanted Software. Fairfax, VA, USA. 2008. 57-64.
    [15] Xu L, Zhan ZX, Xu SH, et al. Cross-layer detection of malicious websites. Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy. San Antonio, TX, USA. 2013. 141-152.
    [16] Huang HJ, Qian L, Wang YJ. A SVM-based technique to detect phishing URLs. Information Technology Journal, 2012, 11(7): 921-925. [doi: 10.3923/itj.2012.921.925
    [17] Hou YT, Chang YM, Chen T, et al. Malicious web content detection by machine learning. Expert Systems with Applications, 2010, 37(1): 55-60. [doi: 10.1016/j.eswa.2009.05.023
    [18] Canali D, Cova M, Vigna G, et al. Prophiler: A fast filter for the large-scale detection of malicious web pages. Proceedings of the 20th International Conference on World Wide Web. Hyderabad, India. 2011. 197-206.
    [19] Lee S, Kim J. WarningBird: Detecting suspicious URLs in Twitter Stream. NDSS. 2012. 1-13.
    [20] Zhou ZH, Li M. Semi-supervised regression with co-training. Proceedings of the 19th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA. 2005. 908-913.
    [21] Zhou ZH, Li M. Semisupervised regression with cotraining-style algorithms. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1479-1493. [doi: 10.1109/TKDE.2007.190644
    [22] 梁吉业, 高嘉伟, 常瑜. 半监督学习研究进展. 山西大学学报(自然科学版), 2009, 32(4): 528-534
    [23] 周志华. 基于分歧的半监督学习. 自动化学报, 2013, 39(11): 1871-1878
    [24] McClosky D, Charniak E, Johnson M. Effective self-training for parsing. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Stroudsburg, PA, USA. 2006. 152-159.
    [25] Rosenberg C, Hebert M, Schneiderman H. Semi-supervised self-training of object detection models. Proceedings of 2005 7th IEEE Workshops on Applications of Computer Vision. Breckenridge, CO, USA. 2005. 29-36.
    [26] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory. New York, NY, USA. 1998. 92-100.
    [27] Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training. Proceedings of the 9th International Conference on Information and Knowledge Management. New York, NY, USA. 2000. 86-93.
    [28] Zhou ZH. Disagreement-based semi-supervised learning. Acta Automatica Sinica, 2013, 39(11): 1871-1878. [doi: 10.3724/SP.J.1004.2013.01871
    [29] Sindhwani V, Niyogi P, Belkin M. A co-regularized approach to semi-supervised learning w瑩慴瑨椠潭湵獬?潩晰?獥攠湶瑩敥湷捳攮猠?慲湯摣?摥潤捩畮浧敳渠瑯獦??偨牥漠挲攲敮摤椠湗杯獲?潳晨?瑰栠敯???獥瑡??湩瑮敧爠湷慩瑴楨漠湍慵汬??潰湬晥攠牖敩湥捷敳?漠湃??湢瑲敩牤湧慥琬椠潕湋愮氠′?漰渵昮攠爸攲渴挭攸″漱渮??慲挾桛椳渰敝??敲慥牦湥楬湤朠???敇椦樣椲渲朸???桮楥湲愠??㈠こ??????????ㄠ????扴爠?孬?土嵩 ̄圮愠汅汦慦捩档?????呣潯瀭楲捥?浵潬摡敲汩楳湥杤???敡祳潴渠摳?扵慡杲?潳映?睥潧牲摥獳??偯牮漮挠敐敲摯楣湥来獤?潮晧?琠桯敦???牥搠′?湲瑤攠牉湮慴瑥楲潮湡慴汩??潡湬映敃牯敮湦捥敲?潮湣??慯据栠楍湡散??敮慥爠湌楥湡杲??乮敧眮?奎潥牷欠??乲奫??啎卙????ぁ???‰?????????戴爴?嬼??崾??椱歝漠汆潡癲?呵???栠敊湄?????潲牤牯慯摮漠?????楥?敧琠?慙氬??楩????晡晬椼振楩放渮琠?敷獯琠楶浩慥瑷椠潬湥?潲普?睮潧爺搠?牖敍瀭爲敋猬攠湴瑨慥瑯楲潹渠獡?楤渠?癲敡捣瑴潩牣?献瀠慐捲敯??健牤潩据敧敳搠楯湦朠獴?潥映?琸桴敨??獮瑴??湮瑡整物湯慮瑡楬漠湃慯汮??潲湥普散牥攠湯据攠?潥湵??敬愠牉湮楦湯杲?剡整灩牯敮猠敐湲瑯慣瑥楳潳湩獮??卓捹潳瑴瑥獭摳愮氠敃???婲??啧卥?????ㄠ???戵爮?嬳??崭″?椲欮漼汢潲瘾?吳??匠畓瑲獩此敨癡敲牡?????桋敡湫?????楍?攠瑁?愠汩??楯????楩獯瑮爠楴扨略瑯敲摥?物散瀠牦敲獡敭湥瑷慯瑲楫漠湦獯?漠晭?睬潴物搭獶?慥湷搠?灥桡牲慮獩敮獧?愠湐摲?瑣桥敥楤物?捧潳洠灯潦猠楴瑨楥漠渲愱汳楴琠祁??偵牡潬挠敃敯摮楦湥杲獥?潣晥?瑯桮攠???瑲桮??湧琠敔牨湥慯瑲楹漮渠慈汥??潩湮晫敩爬攠湆捩敮?潡湮?丮攠甲爰愰永??渴昰漳爭洴愱琴椮漼湢?倾牛漳挳敝猠獇楯湬杤?卡祮猠瑓敁洬猠??副敵搠??漠潅歮??乮奣??啧匠????ひ?????ㄠ???????? with unlabeled data. Proceedings of the 17th International Conference on Machine Learning. San Francisco, CA, USA. 2000. 327-334.
    [34] Fushiki T. Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 2011, 21(2): 137-146. [doi: 10.1007/s11222-009-9153-8
    [35] Zhou ZH, Li M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529. [doi: 10.1109/TKDE.2005.186
    [36] Wang W, Zhou ZH. Co-training with insufficient views. Proceedings of the 5th Asian Conference on Machine Learning. Canberra, Australia. 2013. 467-482.
    [37] Ma J, Saul LK, Savage S, et al. Learning to detect malicious URLs. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 30
    [38] 徐冬冬, 谢统义, 万卓昊, 等. 基于TF-IDF文本向量化的SQL注入攻击检测. 广西大学学报(自然科学版), 2018, 43(5): 1818-1826
    [39] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. The Journal of Machine Learning Research, 2003, 3: 1137-1155
    [40] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011, 12: 2493-2537
    [41] Le Q, Mikolov T. Distributed represen???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

麻瓯勃,刘雪娇,唐旭栋,周宇轩,胡亦承.基于半监督学习的恶意URL检测方法.计算机系统应用,2020,29(11):11-20

复制
分享
文章指标
  • 点击次数:1133
  • 下载次数: 3192
  • HTML阅读次数: 2158
  • 引用次数: 0
历史
  • 收稿日期:2019-11-18
  • 最后修改日期:2019-12-11
  • 在线发布日期: 2020-10-30
文章二维码
您是第11125472位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号