本文已被:浏览 858次 下载 2007次
Received:June 15, 2020 Revised:July 14, 2020
Received:June 15, 2020 Revised:July 14, 2020
中文摘要: 针对词向量文本分类模型记忆能力弱, 缺少全局词特征信息等问题, 提出基于宽度和词向量特征的文本分类模型(WideText): 首先对文本进行清洗、分词、词元编码和定义词典等, 计算全局词元的词频-逆文档频度(TF-IDF)指标并将每条文本向量化, 将输入文本中的词通过编码映射到词嵌入矩阵中, 词向量特征经嵌入和平均叠加后, 和基于TF-IDF的文本向量特征进行拼接, 传入到输出层后计算属于每个分类的概率. 该模型在低维词向量的基础上结合了文本向量特征的表达能力, 具有良好的泛化和记忆能力. 实验结果表明, 在引入宽度特征后, WideText分类性能不仅较词向量文本分类模型有明显提升, 且略优于前馈神经网络分类器.
Abstract:To resolve the issues of weak memory ability and no global word feature information in the word-vector-based text classification model, we propose a text classification model (WideText) based on the width and word vector features. Firstly, text cleaning, word segmentation, unit encoding and dictionary definitions are carried out. Secondly, the Term Frequency-Inverse Document Frequency (TF-IDF) index of the global word units is calculated and each text is vectorized. Furthermore, the words in the input text are mapped to the word embedding matrix through encoding. After the word vector features are embedded and averagely superimposed, they are spliced with the text vector features based on TF-IDF and transmitted to the output layer. Finally, the probability of the features belonging to each category is calculated. The proposed model combines the expressive ability of text vector features on the basis of low-dimensional word vectors and has excellent generalization and memory abilities. The experimental results show that after the introduction of the width feature, the WideText classification performance is significantly improved in comparison with that in the word-vector-based text classification model and also slightly better than that in the feedforward neural network classifiers.
keywords: Word2Vec FastText WideText text classification
文章编号: 中图分类号: 文献标志码:
基金项目:
Author Name | Affiliation | |
LI Xue-Song | Digital Personal Banking Department, Bank of China, Beijing 100818, China | wljrbdsjlxs_hq@mail.notes.bank-of-china.com |
Author Name | Affiliation | |
LI Xue-Song | Digital Personal Banking Department, Bank of China, Beijing 100818, China | wljrbdsjlxs_hq@mail.notes.bank-of-china.com |
引用文本:
李雪松.基于宽度和词向量特征的文本分类模型.计算机系统应用,2021,30(3):177-183
LI Xue-Song.Text Classification Model Based on Width and Word Vector Feature.COMPUTER SYSTEMS APPLICATIONS,2021,30(3):177-183
李雪松.基于宽度和词向量特征的文本分类模型.计算机系统应用,2021,30(3):177-183
LI Xue-Song.Text Classification Model Based on Width and Word Vector Feature.COMPUTER SYSTEMS APPLICATIONS,2021,30(3):177-183