基于宽度和词向量特征的文本分类模型

doi:10.15888/j.cnki.csa.007827

微信公众号

网站二维码

首页 > 过刊浏览>2021年第30卷第3期 >177-183. DOI:10.15888/j.cnki.csa.007827

PDF HTML阅读 XML下载导出引用引用提醒

基于宽度和词向量特征的文本分类模型
DOI:
                        10.15888/j.cnki.csa.007827
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Text Classification Model Based on Width and Word Vector Feature

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对词向量文本分类模型记忆能力弱, 缺少全局词特征信息等问题, 提出基于宽度和词向量特征的文本分类模型(WideText): 首先对文本进行清洗、分词、词元编码和定义词典等, 计算全局词元的词频-逆文档频度(TF-IDF)指标并将每条文本向量化, 将输入文本中的词通过编码映射到词嵌入矩阵中, 词向量特征经嵌入和平均叠加后, 和基于TF-IDF的文本向量特征进行拼接, 传入到输出层后计算属于每个分类的概率. 该模型在低维词向量的基础上结合了文本向量特征的表达能力, 具有良好的泛化和记忆能力. 实验结果表明, 在引入宽度特征后, WideText分类性能不仅较词向量文本分类模型有明显提升, 且略优于前馈神经网络分类器.

Abstract:

To resolve the issues of weak memory ability and no global word feature information in the word-vector-based text classification model, we propose a text classification model (WideText) based on the width and word vector features. Firstly, text cleaning, word segmentation, unit encoding and dictionary definitions are carried out. Secondly, the Term Frequency-Inverse Document Frequency (TF-IDF) index of the global word units is calculated and each text is vectorized. Furthermore, the words in the input text are mapped to the word embedding matrix through encoding. After the word vector features are embedded and averagely superimposed, they are spliced with the text vector features based on TF-IDF and transmitted to the output layer. Finally, the probability of the features belonging to each category is calculated. The proposed model combines the expressive ability of text vector features on the basis of low-dimensional word vectors and has excellent generalization and memory abilities. The experimental results show that after the introduction of the width feature, the WideText classification performance is significantly improved in comparison with that in the word-vector-based text classification model and also slightly better than that in the feedforward neural network classifiers.

参考文献

相似文献

引证文献

引用本文

李雪松.基于宽度和词向量特征的文本分类模型.计算机系统应用,2021,30(3):177-183

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2020-06-15
最后修改日期:2020-07-14
录用日期:
在线发布日期: 2021-03-06
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码