Chinese Long Text Classification Based on FastText and Key Sentence Extraction
CSTR:
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [15]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    FastText is a precise and efficient text classification model, but the precision is low when it is directly applied to Chinese long text classification. Regarding this problem, this study proposes a FastText method for Chinese long text classification, which combines TextRank key clause extraction with Term Frequency-Inverse Document Frequency (TF-IDF). Firstly, TextRank is used to extract the key clauses of the text as input features. Secondly, key words of the text are extracted by TF-IDF as a feature supplement. Finally, the extracted text features are input into the FastText model, which can preserve the key features of the target text while reducing the training corpus. The experimental results show that the accuracy of the proposed method on the datasets is 86.1%, which is about 4% higher than the classic FastText model.

    Reference
    [1] 于游, 付钰, 吴晓平. 中文文本分类方法综述. 网络与信息安全学报, 2019, 5(5):1-8.[doi:10.11959/j.issn.2096-109x.2019045
    [2] 牛雪莹, 赵恩莹. 基于Word2Vec的微博文本分类研究. 计算机系统应用, 2019, 28(8):256-261.[doi:10.15888/j.cnki.csa.007030
    [3] 段旭磊, 张仰森, 孙祎卓. 微博文本的句向量表示及相似度计算方法研究. 计算机工程, 2017, 43(5):143-148.[doi:10.3969/j.issn.1000-3428.2017.05.023
    [4] Borgers DP, Heemels WPMH. Event-separation properties of event-triggered control systems. IEEE Transactions on Automatic Control, 2014, 59(10):2644-2656.[doi:10.1109/TAC.2014.2325272
    [5] 冯勇, 屈渤浩, 徐红艳, 等. 融合TF-IDF和LDA的中文FastText短文本分类方法. 应用科学学报, 2019, 37(3):378-388.[doi:10.3969/j.issn.0255-8297.2019.03.008
    [6] 阴爱英, 吴运兵, 郑一江, 等. 基于fastText模型的词向量表示改进算法. 福州大学学报(自然科学版), 2019, 47(3):314-319
    [7] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    [8] 马思丹, 刘东苏. 基于加权Word2Vec的文本分类方法研究. 情报科学, 2019, 37(11):38-42.[doi:10.13833/j.issn.1007-7634.2019.11.006
    [9] 杨萌萌, 黄浩, 程露红, 等. 基于LDA主题模型的短文本分类. 计算机工程与设计, 2016, 37(12):3371-3377.[doi:10.16208/j.issn1000-7024.2016.12.044
    [10] 叶雪梅, 毛雪岷, 夏锦春, 等. 文本分类TF-IDF算法的改进研究. 计算机工程与应用, 2019, 55(2):104-109, 161.[doi:10.3778/j.issn.1002-8331.1805-0071
    [11] Liu PF, Qiu XP, Huang XJ. Recurrent neural network for text classification with multi-task learning. Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York, NY, USA. 2016. 2873-2879.
    [12] Kim Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
    [13] Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain. 2016. 427-431.
    [14] Mihalcea R, Tarau P. TextRank:Bringing order into texts. Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain. 2004. 404-411.
    [15] 李娜娜, 刘培玉, 刘文锋, 等. 基于TextRank的自动摘要优化算法. 计算机应用研究, 2019, 36(4):1045-1050.[doi:10.19734/j.issn.1001-3695.2017.11.0786
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

汪家成,薛涛.基于FastText和关键句提取的中文长文本分类.计算机系统应用,2021,30(8):213-218

Copy
Share
Article Metrics
  • Abstract:1287
  • PDF: 2633
  • HTML: 3262
  • Cited by: 0
History
  • Received:November 12,2020
  • Revised:December 14,2020
  • Online: August 03,2021
Article QR Code
You are the first990417Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063