面向标签共现和长尾分布的层级文本分类
作者:

Hierarchical Text Classification for Label Co-occurrence and Long-tail Distribution
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [27]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    针对当下层级文本分类模型尚未充分利用层级实例的标签信息以及缺乏对类别分布不平衡的处理这两方面问题, 本文提出一种面向标签共现和长尾分布的层级文本分类方法(hierarchical text classification for label co-occurrence and long-tail distribution, LC-LTD), 对基于共享标签的文本全局语义和面向长尾分布的平衡损失函数进行研究. 首先, 设计一种基于共享标签的对比学习目标, 使具有更多共享标签的文本表示在特征空间中的语义距离更近, 引导模型生成具有判别性的语义表征; 其次, 引入分布平衡损失函数替换二进制交叉熵损失, 缓解层级分类固有的长尾分布问题, 提高模型的泛化能力. 在WOS、BGC两个公开数据集上将LC-LTD与当前多个主流模型进行比较, 结果表明所提方法具有更好的分类性能, 更适合处理层级文本分类任务.

    Abstract:

    There are two problems in existing hierarchical text classification model: underutilization of the label information across hierarchical instances, and lack of handling unbalanced label distribution. To solve these problems, this study proposes a hierarchical text classification method for label co-occurrence and long-tail distribution (LC-LTD) to study the global semantic of text based on shared labels and balanced loss function for long-tail distribution. First, a contrastive learning objective based on shared labels is devised to narrow the semantic distance between text representations with more shared labels in feature space and to guide the model to generate discriminative semantic representations. Second, the distribution balanced loss function is introduced to replace binary cross-entropy loss to alleviate the long-tail distribution problem inherent in hierarchical classification, improving the generalization ability of the model. LC-LTD is compared with various mainstream models on WOS and BGC public datasets, and the results show that the proposed method achieves better classification performance and is more suitable for hierarchical text classification.

    参考文献
    [1] Zhang Y, Shen ZH, Dong YX, et al. MATCH: Metadata-aware text classification in a large hierarchy. Proceedings of the Web Conference 2021. Ljubljana: ACM, 2021. 3246–3257.
    [2] 黄威. 层次化多标签分类方法及其应用研究 [博士学位论文]. 合肥: 中国科学技术大学, 2023.
    [3] 郭豪. 基于标签嵌入的层级多标签分类方法设计 [硕士学位论文]. 重庆: 重庆邮电大学, 2022.
    [4] Kumar A, Toshinwal D. HLC: Hierarchically-aware label correlation for hierarchical text classification. Applied Intelligence, 2024, 54(2): 1602–1618.
    [5] Zangari A, Marcuzzo M, Rizzo M, et al. Hierarchical text classification and its foundations: A review of current research. Electronics, 2024, 13(7): 1199.
    [6] Cao YK, Wei ZY, Tang YJ, et al. Hierarchical label text classification method with deep-level label-assisted classification. Proceedings of the 12th IEEE Data Driven Control and Learning Systems Conference. Xiangtan: IEEE, 2023. 1467–1474.
    [7] Wang ZH, Wang PY, Huang LZ, et al. Incorporating hierarchy into text encoder: A contrastive learning approach for hierarchical text classification. arXiv:2203.03825v2, 2022.
    [8] Su XA, Wang R, Dai XY. Contrastive learning-enhanced nearest neighbor mechanism for multi-label text classification. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: ACL, 2022. 672–679.
    [9] 王嫄, 徐涛, 王世龙, 等. 层级标签语义引导的极限多标签文本分类策略. 中文信息学报, 2021, 35(10): 110–118.
    [10] 赵海燕, 曹杰, 陈庆奎, 等. 层次多标签文本分类方法. 小型微型计算机系统, 2022, 43(4): 673–683.
    [11] Li SB, Hu J, Cui YX, et al. DeepPatent: Patent classification with convolutional neural networks and word embedding. Scientometrics, 2018, 117(2): 721–744.
    [12] Li YD, Zhang YQ, Zhao Z, et al. CSL: A large-scale Chinese scientific literature dataset. Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju: ACL, 2022. 3917–3923.
    [13] Banerjee S, Akkaya C, Perez-Sorrosal F, et al. Hierarchical transfer learning for multi-label text classification. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 6295–6300.
    [14] Shimura K, Li JY, Fukumoto F. HFT-CNN: Learning hierarchical category structure for multi-label short text categorization. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: ACL, 2018. 811–816.
    [15] 滕思洁. 基于图神经网络的层级文本分类 [硕士学位论文]. 合肥: 中国科学技术大学, 2022.
    [16] Zhou J, Ma CP, Long DK, et al. Hierarchy-aware global model for hierarchical text classification. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 1106–1117.
    [17] Chen HB, Ma QL, Lin ZX, et al. Hierarchy-aware label semantics matching network for hierarchical text classification. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ACL, 2021. 4370–4379.
    [18] Deng ZF, Peng H, He DX, et al. HTCInfoMax: A global model for hierarchical text classification via information maximization. arXiv:2104.05220v1, 2021.
    [19] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv:1810.04805v2, 2019.
    [20] Ying CX, Cai TL, Luo SJ, et al. Do transformers really perform bad for graph representation? Proceedings of the 35th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2021. 2212.
    [21] Jang E, Gu SX, Poole B. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144v5, 2017.
    [22] Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020. 1597–1607.
    [23] Wu T, Huang QQ, Liu ZW, et al. Distribution-balanced loss for multi-label classification in long-tailed datasets. Proceedings of the 16th European Conference on Computer Vision. Glasgow: Springer, 2020. 162–178.
    [24] Kowsari K, Brown DE, Heidarysafa M, et al. HDLTex: Hierarchical deep learning for text classification. Proceedings of the 16th IEEE International Conference on Machine Learning and Applications. Cancun: IEEE, 2017. 364–371.
    [25] Aly R, Remus S, Biemann C. Hierarchical multi-label classification of text with capsule networks. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Florence: ACL, 2019. 323–330.
    [26] Peng H, Li JX, He Y, et al. Large-scale hierarchical text classification with recursively regularized deep graph-CNN. Proceedings of the 2018 World Wide Web Conference. Lyon: International World Wide Web Conferences Steering Committee, 2018. 1063–1072.
    [27] Zhu H, Zhang C, Huang JJ, et al. HiTIN: Hierarchy-aware tree isomorphism network for hierarchical text classification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto: ACL, 2023. 7809–7821.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

智媛,雷海卫,张斌龙.面向标签共现和长尾分布的层级文本分类.计算机系统应用,2025,34(2):174-182

复制
分享
文章指标
  • 点击次数:82
  • 下载次数: 301
  • HTML阅读次数: 64
  • 引用次数: 0
历史
  • 收稿日期:2024-07-29
  • 最后修改日期:2024-08-20
  • 在线发布日期: 2024-11-28
文章二维码
您是第11197721位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号