Abstract:The task of document classification in natural language processing requires the model to extract high-level features from low-level word vectors. Generally, the feature extraction of deep neural networks uses all the words in the document, which is not well suited for documents with long content. In addition, training deep neural networks requires massive labeled data, which often fails to achieve satisfied results under weak supervision. To meet these challenges, this research proposes a method to deal with weakly-supervised long document classification. On the one hand, a small amount of seed information is used to generate pseudo-documents to enhance training data to deal with the situation where accuracy is difficult to improve due to the lack of labeled data. On the other hand, using recurrent local attention learning to extract summary features based on only a few document fragments is sufficient to support subsequent category prediction and improve the model’s speed and accuracy. Experiments show that the pseudo-document generation model can indeed enhance the training data, and the improvement in prediction accuracy is particularly significant under weak supervision. At the same time, the long document classification model based on the local attention mechanism performs significantly better than benchmark models in prediction accuracy and processing speed, with practical application value.