基于宽度和词向量特征的文本分类模型
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Text Classification Model Based on Width and Word Vector Feature
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对词向量文本分类模型记忆能力弱, 缺少全局词特征信息等问题, 提出基于宽度和词向量特征的文本分类模型(WideText): 首先对文本进行清洗、分词、词元编码和定义词典等, 计算全局词元的词频-逆文档频度(TF-IDF)指标并将每条文本向量化, 将输入文本中的词通过编码映射到词嵌入矩阵中, 词向量特征经嵌入和平均叠加后, 和基于TF-IDF的文本向量特征进行拼接, 传入到输出层后计算属于每个分类的概率. 该模型在低维词向量的基础上结合了文本向量特征的表达能力, 具有良好的泛化和记忆能力. 实验结果表明, 在引入宽度特征后, WideText分类性能不仅较词向量文本分类模型有明显提升, 且略优于前馈神经网络分类器.

    Abstract:

    To resolve the issues of weak memory ability and no global word feature information in the word-vector-based text classification model, we propose a text classification model (WideText) based on the width and word vector features. Firstly, text cleaning, word segmentation, unit encoding and dictionary definitions are carried out. Secondly, the Term Frequency-Inverse Document Frequency (TF-IDF) index of the global word units is calculated and each text is vectorized. Furthermore, the words in the input text are mapped to the word embedding matrix through encoding. After the word vector features are embedded and averagely superimposed, they are spliced with the text vector features based on TF-IDF and transmitted to the output layer. Finally, the probability of the features belonging to each category is calculated. The proposed model combines the expressive ability of text vector features on the basis of low-dimensional word vectors and has excellent generalization and memory abilities. The experimental results show that after the introduction of the width feature, the WideText classification performance is significantly improved in comparison with that in the word-vector-based text classification model and also slightly better than that in the feedforward neural network classifiers.

    参考文献
    相似文献
    引证文献
引用本文

李雪松.基于宽度和词向量特征的文本分类模型.计算机系统应用,2021,30(3):177-183

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-06-15
  • 最后修改日期:2020-07-14
  • 录用日期:
  • 在线发布日期: 2021-03-06
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号