Weakly-Supervised Long Document Classification Based on Local Attention Mechanism
CSTR:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    The task of document classification in natural language processing requires the model to extract high-level features from low-level word vectors. Generally, the feature extraction of deep neural networks uses all the words in the document, which is not well suited for documents with long content. In addition, training deep neural networks requires massive labeled data, which often fails to achieve satisfied results under weak supervision. To meet these challenges, this research proposes a method to deal with weakly-supervised long document classification. On the one hand, a small amount of seed information is used to generate pseudo-documents to enhance training data to deal with the situation where accuracy is difficult to improve due to the lack of labeled data. On the other hand, using recurrent local attention learning to extract summary features based on only a few document fragments is sufficient to support subsequent category prediction and improve the model’s speed and accuracy. Experiments show that the pseudo-document generation model can indeed enhance the training data, and the improvement in prediction accuracy is particularly significant under weak supervision. At the same time, the long document classification model based on the local attention mechanism performs significantly better than benchmark models in prediction accuracy and processing speed, with practical application value.

    Reference
    Related
    Cited by
Get Citation

马雯琦,何跃.基于局部注意力机制的弱监督长文档分类.计算机系统应用,2021,30(11):54-62

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 30,2021
  • Revised:March 05,2021
  • Adopted:
  • Online: October 22,2021
  • Published:
Article QR Code
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063