基于Word2Vec的中文短文本分类问题研究
作者:
基金项目:

赛尔网络下一代互联网技术创新项目(NGII20150106)


Research on Chinese Short Text Classification Based on Word2Vec
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    针对短文本中固有的特征稀疏以及传统分类模型存在的“词汇鸿沟”等问题,我们利用Word2Vec模型可以有效缓解短文本中数据特征稀疏的问题,并且引入传统文本分类模型中不具有的语义关系.但进一步发现单纯利用Word2Vec模型忽略了不同词性的词语对短文本的影响力,因此引入词性改进特征权重计算方法,将词性对文本分类的贡献度嵌入到传统的TF-IDF算法中计算短文本中词的权重,并结合Word2Vec词向量生成短文本向量,最后利用SVM实现短文本分类.在复旦大学中文文本分类语料库上的实验结果验证了该方法的有效性.

    Abstract:

    To address the problems such as the inherent sparsity in the short text and the "lexical gap" of traditional classification model, using Word2Vec model to map words to a spatial vector of low-dimensional real number according to context semantic relations can effectively ease the sparse feature issue of short text. However, further study found that only using Word2Vec will ignore the influence of different parts of speech on the short text. Therefore, we introduce part of speech to improve the feature weighting approach, in which the contribution of speech is embedded into the traditional TF-IDF algorithm to calculate the weight of the words in the short text, and the vector of short text is generated by combining the word vector of Word2Vec. Finally, we use the SVM to achieve short text classification. Experimental results on Fudan University Chinese text classification corpus validate the effectiveness of the proposed method.

    参考文献
    [1] Manyika J, Chui M, Brown B, et al. Big data:The next frontier for innovation, competition, and productivity. McKinsey Global Institute. https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation.[2015-09-05].
    [2] 余凯, 贾磊, 陈雨强. 深度学习:推进人工智能的梦想. 程序员, 2013, (6):22-27.
    [3] Ling W, Luís T, Marujo L, et al. Finding function in form:Compositional character models for open vocabulary word representation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal. 2015. 1520-1530.
    [4] 朱征宇, 孙俊华. 改进的基于《知网》的词汇语义相似度计算. 计算机应用, 2013, 33(8):2276-2279, 2288.
    [5] 王荣波, 谌志群, 周建政, 等. 基于Wikipedia的短文本语义相关度计算方法. 计算机应用与软件, 2015, 32(1):82-85, 92.
    [6] Rubin TN, Chambers A, Smyth P, et al. Statistical topic models for multi-label document classification. Machine Learning, 2012, 88(1-2):157-208.[DOI:10.1007/s10994-011-5272-5]
    [7] Dumais ST. Latent semantic analysis. Annual Review of Information Science and Technology, 2004, 38(1):188-230.
    [8] Hofmann T. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, CA, USA. 1999. 50-57.
    [9] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. JMLR. org, 2003. (请补充出版信息)Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation[J]. J Machine Learning Research Archive, 2003, 3:993-1022.
    [10] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2013. 3111-3119.
    [11] Zheng XQ, Chen HY, Xu TY. Deep learning for Chinese word segmentation and POS tagging. Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, WA, USA. 2013. 647-657.
    [12] Tang DY, Wei FR, Yang N, et al. Learning sentiment-specific word embedding for twitter sentiment classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, MD, USA. 2014. 1555-1565.
    [13] Kim HK, Kim H, Cho S. Bag-of-concepts:Comprehending document representation through clustering words in distributed representation. Neurocomputing, 2017, 266:336-352.[DOI:10.1016/j.neucom.2017.05.046]
    [14] Socher R, Bauer J, Manning CD, et al. Parsing with compositional vector grammars. Proceedings of the 51st Meeting of the Association for Computational Linguistics. Sofia, Bulgaria. 2013. 455-465.
    [15] Lilleberg J, Zhu Y, Zhang YQ. Support vector machines and Word2vec for text classification with semantic features. Proceedings of the IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing. Beijing, China. 2015. 136-140.
    [16] Xing C, Wang D, Zhang XW, et al. Document classification with distributions of word vectors. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Siem Reap, Cambodia. 2014. 1-5.
    [17] Le QV, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning. Beijing, China. 2014. 1188-1196.
    [18] 唐明, 朱磊, 邹显春. 基于Word2Vec的一种文档向量表示. 计算机科学, 2016, 43(6):214-217, 269.[DOI:10.11896/j. issn.1002-137X.2016.06.043]
    [19] Turian J, Ratinov L, Bengio Y. Word representations:A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden. 2010. 384-394.
    [20] Sun YM, Lin L, Yang N, et al. Radical-enhanced Chinese character embedding. In:Loo CK, Yap KS, Wong KW, et al, eds. Neural Information Processing. Cham:Springer, 2014, (8835):279-286.
    [21] 张玉芳, 彭时名, 吕佳. 基于文本分类TFIDF方法的改进与应用. 计算机工程, 2006, 32(19):76-78.[DOI:10.3969/j. issn.1000-3428.2006.19.028]
    [22] 黄贤英, 张金鹏, 刘英涛, 等. 基于词项语义映射的短文本相似度算法. 计算机工程与设计, 2015, 36(6):1514-1518, 1534.
    [23] 李玲俐. 数据挖掘中分类算法综述. 重庆师范大学学报(自然科学版), 2011, 28(4):44-47.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

汪静,罗浪,王德强.基于Word2Vec的中文短文本分类问题研究.计算机系统应用,2018,27(5):209-215

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2017-08-18
  • 最后修改日期:2017-09-05
  • 在线发布日期: 2018-03-12
文章二维码
您是第11580667位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号