基于语义增强的短文本主题模型
作者:
基金项目:

陕西省自然科学基金(2019JQ-849); 柯桥纺织产业创新项目(19KQYB23)


Short Text Topic Model Based on Semantic Enhancement
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [17]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    传统主题模型方法很大程度上依赖于词共现模式生成文档主题, 短文本由于缺乏足够的上下文信息导致的数据稀疏性成为传统主题模型在短文本上取得良好效果的瓶颈. 基于此, 本文提出一种基于语义增强的短文本主题模型, 算法将DMM (Dirichlet Multinomial Mixture)与词嵌入模型相结合, 通过训练全局词嵌入与局部词嵌入获得词的向量表示, 融合全局词嵌入向量与局部词嵌入向量计算词向量间的语义相关度, 并通过主题相关词权重进行词的语义增强计算. 实验表明, 本文提出的模型在主题一致性表示上更准确, 且提升了模型在短文本上的分类正确率.

    Abstract:

    Traditional topic models rely largely on word co-occurrence patterns to generate text topics. The data sparseness of short texts due to insufficient context has restrained traditional topic models from achieving good results with regard to short texts. On this basis, this study proposes a short text topic model based on semantic enhancement. The algorithm integrates the Dirichlet Multinomial Mixture (DMM) model with a word embedding model. It obtains the vector representation of words by training global word embedding and local word embedding and calculates the semantic correlation between word vectors with cosine similarity. Besides, it enhances the semantic meaning of words by calculating the weight of topic-related words. Experiments demonstrate the proposed model is more accurate in consistence of topic representation and improves the classification accuracy of the model in regard to short texts.

    参考文献
    [1] 朱佳晖. 基于深度学习的主题建模方法研究[硕士学位论文]. 武汉: 武汉大学, 2017.
    [2] 花树雯, 张云华. 改进主题模型的短文本评论情感分析. 计算机系统应用, 2019, 28(3): 255–259. [doi: 10.15888/j.cnki.csa.006829
    [3] Bassiou NK, Kotropoulos CL. Online PLSA: Batch updating techniques including out-of-vocabulary words. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(11): 1953–1966. [doi: 10.1109/TNNLS.2014.2299806
    [4] Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022
    [5] Chen QX, Yao LX, Yang J. Short text classification based on LDA topic model. Proceedings of 2016 International Conference on Audio, Language and Image Processing (ICALIP). Shanghai, China. 2016. 749–753.
    [6] Papanikolaou Y, Tsoumakas G. Subset labeled LDA: A topic model for extreme multi-label classification. Proceedings of the 20th International Conference on Big Data Analytics and Knowledge Discovery. Regensburg, Germany. 2018. 152–162.
    [7] Cheng XQ, Yan XH, Lan YY, et al. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2928–2941. [doi: 10.1109/TKDE.2014.2313872
    [8] Zuo Y, Wu JJ, Zhang H, et al. Topic modeling of short texts: A pseudo-document view. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA, USA. 2016. 2105–2114.
    [9] Ma TH, Li J, Liang XN, et al. A time-series based aggregation scheme for topic detection in Weibo short texts. Physica A: Statistical Mechanics and Its Applications, 2019, 536: 120972. [doi: 10.1016/j.physa.2019.04.208
    [10] Jiang L, Lu HY, Xu M, et al. Biterm pseudo document topic model for short text. Proceedings of 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). San Jose, CA, USA. 2016. 865–872.
    [11] Yin JH, Wang JY. A Dirichlet multinomial mixture model-based approach for short text clustering. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA. 2014. 233–242.
    [12] Li CL, Wang HR, Zhang ZQ, et al. Topic modeling for short texts with auxiliary word embeddings. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Pisa, Italy. 2016. 165–174.
    [13] Liang WX, Feng R, Liu XY, et al. GLTM: A global and local word embedding-based topic model for short texts. IEEE Access, 2018, 6: 43612–43621. [doi: 10.1109/ACCESS.2018.2863260
    [14] 陈敏. 基于词性特征与语义增强的短文本主题模型研究与应用[硕士学位论文]. 南京: 南京大学, 2019.
    [15] Zhang XC, Feng R, Liang WX. Short text topic model with word embeddings and context information. Proceedings of the 14th International Conference on Computing and Information Technology. Cham, UK. 2018. 55–64.
    [16] Xun GX, Gopalakrishnan V, Ma FL, et al. Topic discovery for short texts using word embeddings. Proceedings of 2016 IEEE 16th International Conference on Data Mining (ICDM). Barcelona, Spain. 2016. 1299–1304.
    [17] Li CL, Duan Y, Wang HR, et al. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Transactions on Information Systems, 2017, 36(2): 11
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

高娟,张晓滨.基于语义增强的短文本主题模型.计算机系统应用,2021,30(6):141-147

复制
分享
文章指标
  • 点击次数:1225
  • 下载次数: 2130
  • HTML阅读次数: 2454
  • 引用次数: 0
历史
  • 收稿日期:2020-10-05
  • 最后修改日期:2020-11-02
  • 在线发布日期: 2021-06-05
文章二维码
您是第12820687位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号