Research on Chinese Weibo Text Classification Based on Word2Vec
CSTR:
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [28]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    The Chinese Weibo is an indispensable communication tool for people today. Mining information in Weibo text is of great significance to automatic question and answer, public opinion analysis and other applied research. The short text classification study is the basis of short text mining. The neural network-based Word2Vec can solve problems of high-dimensional sparseness and semantic gap that traditional text categorization methods cannot solve. This study obtains the word vector based on Word2Vec, then the class factor is introduced into the traditional weight calculation method TF-IDF (Term Frequency-Inverse Document Frequency) to design the word vector weight. Finally, the SVM classifier is used for classification. The effectiveness of the method is verified by experiments on Weibo data.

    Reference
    [1] 盛成成, 朱勇, 刘涛. 基于微博社交平台的舆情分析. 智能计算机与应用, 2019, 9(1):57-59, 64.[doi:10.3969/j.issn.2095-2163.2019.01.013
    [2] Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11):613-620.[doi:10.1145/361219.361220
    [3] Phan XH, Nguyen LM, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th International Conference on World Wide Web. New York, NY, USA. 2008. 91-100.
    [4] 张志飞, 苗夺谦, 高灿. 基于LDA主题模型的短文本分类方法. 计算机应用, 2013, 33(6):1587-1590
    [5] 王细薇, 樊兴华, 赵军. 一种基于特征扩展的中文短文本分类方法. 计算机应用, 2009, 29(3):843-845.[doi:10.3969/j.issn.1001-3695.2009.03.012
    [6] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类. 计算机应用, 2010, 30(3):603-606, 611
    [7] 张振豪, 过弋, 韩美琪, 等. 基于关键词相似度的短文本分类方法研究. 计算机应用研究, 1-6. https://doi.org/10.19734/j.issn.1001-3695.2018.04.0440,2019-01-25.
    [8] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究. 现代图书情报技术, 2012, (3):47-52.[doi:10.11925/infotech.1003-3513.2012.03.08
    [9] 孟涛, 王诚. 基于扩展短文本词特征向量的分类研究. 计算机技术与发展, 2019, 29(4):57-62.[doi:10.3969/j.issn.1673-629X.2019.04.12
    [10] Bouaziz A, Dartigues-Pallez C, Da Costa Pereira C, et al. Short text classification using semantic random forest. Bellatreche L, Mohania M K. Data Warehousing and Knowledge Discovery. Cham:Springer, 2014. 288-299.
    [11] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301. 3781, 2013.
    [12] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2013. 3111-3119.
    [13] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011, 12:2493-2537
    [14] Yan DF, Ke N, Gu C, et al. Multi-label text classification model based on semantic embedding. The Journal of China Universities of Posts and Telecommunications, 2019, 2(1):95-104
    [15] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文本分类研究. 计算机工程与设计, 2013, 34(10):3613-3618.[doi:10.3969/j.issn.1000-7024.2013.10.048
    [16] 韩栋, 王春华, 肖敏. 基于句子级学习改进CNN的短文本分类方法. 计算机工程与设计, 2019, 40(1):256-260, 284
    [17] 冯国明, 张晓冬, 刘素辉. 基于CapsNet的中文文本分类研究. 数据分析与知识发现, 2018, 2(12):68-76.[doi:10.11925/infotech.2096-3467.2018.0391
    [18] Nyberg K, Raiko T, Tiinanen T, et al. Document classification utilising ontologies and relations between documents. Proceedings of the Eighth Workshop on Mining and Learning with Graphs. New York, NY, USA. 2010. 86-93.
    [19] 江大鹏. 基于词向量的短文本分类方法研究[硕士学位论文]. 杭州:浙江大学, 2015.
    [20] 施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述. 计算机应用, 2009, 29(S1):167-170, 180
    [21] 汪静, 罗浪, 王德强. 基于Word2Vec的中文短文本分类问题研究. 计算机系统应用, 2018, 27(5):209-215
    [22] 张谦, 高章敏, 刘嘉勇. 基于Word2Vec的微博短文本分类研究. 信息网络安全, 2017, (1):57-62.[doi:10.3969/j.issn.1671-1122.2017.01.009
    [23] 周茜, 赵明生, 扈旻. 中文文本分类中的特征选择研究. 中文信息学报, 2004, 18(3):17-23.[doi:10.3969/j.issn.1003-0077.2004.03.003
    [24] 刘小敏, 王昊, 李心蕾, 等. 不同特征粒度在微博短文本分类中作用的比较研究. 情报科学, 2018, 36(12):126-133
    [25] Le QV, Mikolov T. Distributed representations of sentences and documents. arXiv prepfint arXiv:1405. 4053, 2014.
    [26] 李玲俐. 数据挖掘中分类算法综述. 重庆师范大学学报(自然科学版), 2011, 28(4):44-47
    [27] Kotsiantis S B. Supervised machine learning:a review of classification techniques. Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering:Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. Amsterdam, the Netherland. 2007. 3-24.
    [28] 王杨, 许闪闪, 李昌, 等. 基于支持向量机的中文极短文本分类模型. 计算机应用研究(优先出版), 1-5. https://doi.org/10.19734/j.issn.1001-3695.2018.06.0514,2018-12-13/2019-02-17.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

牛雪莹,赵恩莹.基于Word2Vec的微博文本分类研究.计算机系统应用,2019,28(8):256-261

Copy
Share
Article Metrics
  • Abstract:1406
  • PDF: 2499
  • HTML: 1743
  • Cited by: 0
History
  • Received:February 17,2019
  • Revised:March 08,2019
  • Online: August 14,2019
  • Published: August 15,2019
Article QR Code
You are the first992326Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063