本文已被:浏览 747次 下载 1549次
Received:January 30, 2021 Revised:March 05, 2021
Received:January 30, 2021 Revised:March 05, 2021
中文摘要: 自然语言处理中的文档分类任务需要模型从低层级词向量中抽取高层级特征. 通常, 深度神经网络的特征抽取会利用文档中所有词语, 这种做法不能很好适应内容较长的文档. 此外, 训练深度神经网络需要大量标记数据, 在弱监督情况下往往不能取得良好效果. 为迎接这些挑战, 本研究提出应对弱监督长文档分类的方法. 一方面, 利用少量种子信息生成伪文档以增强训练数据, 应对缺乏标记数据造成的精度难以提升的局面. 另一方面, 使用循环局部注意力学习, 仅基于若干文档片段抽取出摘要特征, 就足以支撑后续类别预测, 提高模型的速度和精度. 实验表明, 本研究提出的伪文档生成模型确实能够增强训练数据, 对预测精度的提升在弱监督情况下尤为显著; 同时, 基于局部注意力机制的长文档分类模型在预测精度上显著高于基准模型, 处理速度也表现优异, 具有实际应用价值.
Abstract:The task of document classification in natural language processing requires the model to extract high-level features from low-level word vectors. Generally, the feature extraction of deep neural networks uses all the words in the document, which is not well suited for documents with long content. In addition, training deep neural networks requires massive labeled data, which often fails to achieve satisfied results under weak supervision. To meet these challenges, this research proposes a method to deal with weakly-supervised long document classification. On the one hand, a small amount of seed information is used to generate pseudo-documents to enhance training data to deal with the situation where accuracy is difficult to improve due to the lack of labeled data. On the other hand, using recurrent local attention learning to extract summary features based on only a few document fragments is sufficient to support subsequent category prediction and improve the model’s speed and accuracy. Experiments show that the pseudo-document generation model can indeed enhance the training data, and the improvement in prediction accuracy is particularly significant under weak supervision. At the same time, the long document classification model based on the local attention mechanism performs significantly better than benchmark models in prediction accuracy and processing speed, with practical application value.
keywords: document classification deep learning weakly-supervised learning pseudo-document local attention mechanism
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(71571174)
Author Name | Affiliation | |
MA Wen-Qi | School of Management, University of Science and Technology of China, Hefei 230026, China | marchyvt@mail.ustc.edu.cn |
HE Yue | Business School, Sichuan University, Chengdu 610065, China |
Author Name | Affiliation | |
MA Wen-Qi | School of Management, University of Science and Technology of China, Hefei 230026, China | marchyvt@mail.ustc.edu.cn |
HE Yue | Business School, Sichuan University, Chengdu 610065, China |
引用文本:
马雯琦,何跃.基于局部注意力机制的弱监督长文档分类.计算机系统应用,2021,30(11):54-62
MA Wen-Qi,HE Yue.Weakly-Supervised Long Document Classification Based on Local Attention Mechanism.COMPUTER SYSTEMS APPLICATIONS,2021,30(11):54-62
马雯琦,何跃.基于局部注意力机制的弱监督长文档分类.计算机系统应用,2021,30(11):54-62
MA Wen-Qi,HE Yue.Weakly-Supervised Long Document Classification Based on Local Attention Mechanism.COMPUTER SYSTEMS APPLICATIONS,2021,30(11):54-62