基于Word2Vec的中文短文本分类问题研究

doi:10.15888/j.cnki.csa.006325

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年5月15日 8:19 星期四

首页 > 过刊浏览>2018年第27卷第5期 >209-215. DOI:10.15888/j.cnki.csa.006325

PDF HTML阅读 XML下载导出引用引用提醒

基于Word2Vec的中文短文本分类问题研究
DOI:
                        10.15888/j.cnki.csa.006325
                    
CSTR:
                        
                    
作者:
                        汪静汪静
中南民族大学 计算机科学学院, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找
罗浪罗浪
中南民族大学 计算机科学学院, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找
王德强王德强
中南民族大学 计算机科学学院, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:赛尔网络下一代互联网技术创新项目（NGII20150106）

Research on Chinese Short Text Classification Based on Word2Vec

Author:

WANG Jing
WANG Jing
School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找
LUO Lang
LUO Lang
School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找
WANG De-Qiang
WANG De-Qiang
School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [23]

相似文献

引证文献

资源附件

文章评论

摘要:

针对短文本中固有的特征稀疏以及传统分类模型存在的“词汇鸿沟”等问题，我们利用Word2Vec模型可以有效缓解短文本中数据特征稀疏的问题，并且引入传统文本分类模型中不具有的语义关系.但进一步发现单纯利用Word2Vec模型忽略了不同词性的词语对短文本的影响力，因此引入词性改进特征权重计算方法，将词性对文本分类的贡献度嵌入到传统的TF-IDF算法中计算短文本中词的权重，并结合Word2Vec词向量生成短文本向量，最后利用SVM实现短文本分类.在复旦大学中文文本分类语料库上的实验结果验证了该方法的有效性.

关键词:Word2Vec;TF-IDF;文本表示;短文本分类

Abstract:

To address the problems such as the inherent sparsity in the short text and the "lexical gap" of traditional classification model, using Word2Vec model to map words to a spatial vector of low-dimensional real number according to context semantic relations can effectively ease the sparse feature issue of short text. However, further study found that only using Word2Vec will ignore the influence of different parts of speech on the short text. Therefore, we introduce part of speech to improve the feature weighting approach, in which the contribution of speech is embedded into the traditional TF-IDF algorithm to calculate the weight of the words in the short text, and the vector of short text is generated by combining the word vector of Word2Vec. Finally, we use the SVM to achieve short text classification. Experimental results on Fudan University Chinese text classification corpus validate the effectiveness of the proposed method.

Key words:Word2Vec;TF-IDF;text representation;short text classification

参考文献

[1] Manyika J, Chui M, Brown B, et al. Big data:The next frontier for innovation, competition, and productivity. McKinsey Global Institute. https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation.[2015-09-05].

[2] 余凯, 贾磊, 陈雨强. 深度学习:推进人工智能的梦想. 程序员, 2013, (6):22-27.

[3] Ling W, Luís T, Marujo L, et al. Finding function in form:Compositional character models for open vocabulary word representation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal. 2015. 1520-1530.

[4] 朱征宇, 孙俊华. 改进的基于《知网》的词汇语义相似度计算. 计算机应用, 2013, 33(8):2276-2279, 2288.

[5] 王荣波, 谌志群, 周建政, 等. 基于Wikipedia的短文本语义相关度计算方法. 计算机应用与软件, 2015, 32(1):82-85, 92.

[6] Rubin TN, Chambers A, Smyth P, et al. Statistical topic models for multi-label document classification. Machine Learning, 2012, 88(1-2):157-208.[DOI:10.1007/s10994-011-5272-5]

[7] Dumais ST. Latent semantic analysis. Annual Review of Information Science and Technology, 2004, 38(1):188-230.

[8] Hofmann T. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, CA, USA. 1999. 50-57.

[9] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. JMLR. org, 2003. (请补充出版信息)Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation[J]. J Machine Learning Research Archive, 2003, 3:993-1022.

[10] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2013. 3111-3119.

[11] Zheng XQ, Chen HY, Xu TY. Deep learning for Chinese word segmentation and POS tagging. Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, WA, USA. 2013. 647-657.

[12] Tang DY, Wei FR, Yang N, et al. Learning sentiment-specific word embedding for twitter sentiment classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, MD, USA. 2014. 1555-1565.

[13] Kim HK, Kim H, Cho S. Bag-of-concepts:Comprehending document representation through clustering words in distributed representation. Neurocomputing, 2017, 266:336-352.[DOI:10.1016/j.neucom.2017.05.046]

[14] Socher R, Bauer J, Manning CD, et al. Parsing with compositional vector grammars. Proceedings of the 51st Meeting of the Association for Computational Linguistics. Sofia, Bulgaria. 2013. 455-465.

[15] Lilleberg J, Zhu Y, Zhang YQ. Support vector machines and Word2vec for text classification with semantic features. Proceedings of the IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing. Beijing, China. 2015. 136-140.

[16] Xing C, Wang D, Zhang XW, et al. Document classification with distributions of word vectors. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Siem Reap, Cambodia. 2014. 1-5.

[17] Le QV, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning. Beijing, China. 2014. 1188-1196.

[18] 唐明, 朱磊, 邹显春. 基于Word2Vec的一种文档向量表示. 计算机科学, 2016, 43(6):214-217, 269.[DOI:10.11896/j. issn.1002-137X.2016.06.043]

[19] Turian J, Ratinov L, Bengio Y. Word representations:A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden. 2010. 384-394.

[20] Sun YM, Lin L, Yang N, et al. Radical-enhanced Chinese character embedding. In:Loo CK, Yap KS, Wong KW, et al, eds. Neural Information Processing. Cham:Springer, 2014, (8835):279-286.

[21] 张玉芳, 彭时名, 吕佳. 基于文本分类TFIDF方法的改进与应用. 计算机工程, 2006, 32(19):76-78.[DOI:10.3969/j. issn.1000-3428.2006.19.028]

[22] 黄贤英, 张金鹏, 刘英涛, 等. 基于词项语义映射的短文本相似度算法. 计算机工程与设计, 2015, 36(6):1514-1518, 1534.

[23] 李玲俐. 数据挖掘中分类算法综述. 重庆师范大学学报(自然科学版), 2011, 28(4):44-47.

引用本文

汪静,罗浪,王德强.基于Word2Vec的中文短文本分类问题研究.计算机系统应用,2018,27(5):209-215

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2017-08-18
最后修改日期:2017-09-05
录用日期:
在线发布日期: 2018-03-12
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码