基于Doc2Vec与SVM的聊天内容过滤

doi:10.15888/j.cnki.csa.006392

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月24日 5:26 星期四

首页 > 过刊浏览>2018年第27卷第7期 >127-132. DOI:10.15888/j.cnki.csa.006392

PDF HTML阅读 XML下载导出引用引用提醒

基于Doc2Vec与SVM的聊天内容过滤
DOI:
                        10.15888/j.cnki.csa.006392
                    
CSTR:
                        
                    
作者:
                        岳文应岳文应
浙江理工大学 信息学院, 杭州 310018
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Chat Content Filtering Based on Doc2Vec and SVM

Author:

YUE Wen-Ying
YUE Wen-Ying
School of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [16]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

直播系统中用户聊天内容的实时拦截具有非常重大的意义，为了提高分类的准确率和效率，提出了一种基于Doc2Vec与SVM结合的文本分类模型对聊天内容分类，判断聊天内容是否应该被拦截.首先使用Doc2Vec模型将聊天内容表示成密集数值向量的形式，第二部分使用SVM分类器进行分类.通过实验表明，该模型有效地减少了文本表示的维度，提高了训练效率，而且具有的97%的准确率和89.82%召回率，性能优于朴素贝叶斯和基于Doc2Vec的Logistic模型.

关键词:文本分类;自然语言处理;Doc2Vec模型;支持向量机

Abstract:

The real-time interception of user chat content in live broadcast system is of great significance. In order to improve the accuracy and efficiency of the classification, a text classification model based on the combination of Doc2Vec and SVM is proposed to classify the chat content and judge whether the chat content should be intercepted. The First part uses the Doc2Vec model to represent the chat content as a dense numeric vector, and then an SVM classifier is used to classify. The experimental results show that the model greatly reduces the dimension of text representation with high efficiency, and it has excellent accuracy rate (97%) and recall rate (89.82%), which are superior to Naive Bayes and the logistic based on Doc2Vec.

Key words:text classification;Natural Language Processing (NLP);Doc2Vec model;Support Vector Machine (SVM)

参考文献

[1] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1):1-47.[doi:10.1145/505282.505283]

[2] Androutsopoulos I, Paliouras G, Karkaletsis V, et al. Learning to filter spam e-mail:A comparison of a naive Bayesian and a memory-based approach. Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lyon, France. 2000. 1-13.

[3] Sahami M, Dumais S, Heckerman D, et al. A Bayesian approach to filtering junk e-mail. Madison, WI, USA:AIAA, 1998.

[4] Rennie JDM. Ifile:An application of machine learning to e-mail filtering. Proceedings of KDD Workshop on Text Mining. Boston, MA, USA. 2000.

[5] 石霞军, 林亚平, 陈治平. 基于最小风险的贝叶斯邮件过滤算法. 计算机科学, 2002, 29(8):50-51, 46.

[6] McCallum A, Nigam K. A comparison of event models for naive Bayes text classification. Proceedings of AAAI/ICML-98 Workshop on Learning for Text Categorization. Menlo Park, CA, USA. 1998. 41-48.

[7] Leopold E, Kindermann J. Text categorization with support vector machines. How to represent texts in input space? Machine Learning, 2002, 46(1-3):423-444.

[8] Le Q, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning. Beijing, China. 2014. Ⅱ-1188-Ⅱ-1196.

[9] Kiros R, Zemel RS, Salakhutdinov R. A multiplicative model for learning distributed text-based attribute representations. Proceedings of Advances in Neural Information Processing Systems. Montreal, Quebec, Canada. 2014. 2348-2356.

[10] Cherkassky V. The nature of statistical learning theory. IEEE Transactions on Neural Networks, 1997, 8(6):1564.[doi:10.1109/TNN.1997.641482]

[11] Joachims T. Text categorization with support vector machines:Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning. London, UK. 1998. 137-142.

[12] Lan M, Tan CL, Low HB, et al. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. Proceedings of the Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. Chiba, Japan. 2005. 1032-1033.

[13] Drucker H, Wu DH, Vapnik VN. Support vector machines for Spam categorization. IEEE Transactions on Neural Networks, 1999, 10(5):1048-1054.[doi:10.1109/72.788645]

[14] Drummond C, Holte RC. Cost curves:An improved method for visualizing classifier performance. Machine Learning, 2006, 65(1):95-130.[doi:10.1007/s10994-006-8199-5]

[15] Zhou B, Yao YY, Luo JG. Cost-sensitive three-way email spam filtering. Journal of Intelligent Information Systems, 2014, 42(1):19-45.[doi:10.1007/s10844-013-0254-7]

[16] Provost F, Fawcett T. Analysis and visualization of classifier performance:Comparison under imprecise class and cost distributions. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. Newport Beach, CA, USA. 1997. 43-48.

引用本文

岳文应.基于Doc2Vec与SVM的聊天内容过滤.计算机系统应用,2018,27(7):127-132

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2017-10-16
最后修改日期:2017-11-03
录用日期:
在线发布日期: 2018-06-27
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码