基于大语言模型的文本摘要质量评估
作者:
基金项目:

北京市自然科学基金 (4212001)


Text Summarization Quality Evaluation Based on Large Language Model
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [27]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    自动文本摘要是自然语言处理(NLP)领域中的一个重要分支, 其主要难点之一是在于如何快速、客观且准确地评估生成摘要的质量. 针对现有文本摘要质量评估方法中评估准确度不高、需要参考文本以及计算资源消耗大的问题, 本文提出一种基于大语言模型的文本摘要质量评估方法, 设计基于思维链原理的提示词构建方法以提高大语言模型在文本摘要质量评估任务上的性能, 同时生成思维链数据集并以模型微调的方式对小型大语言模型进行训练, 显著降低了计算需求. 本文方法首先根据文本摘要的特点确定评估维度, 并基于思维链原理(chain of thought, CoT)构建提示词; 使用提示词对大型大语言模型进行引导, 使其根据摘要样本生成思维链过程与评估结果, 同时以此为基础生成思维链数据集; 使用生成的思维链数据集对小型大语言模型进行微调训练; 最后使用微调后的小型大语言模型完成文本摘要的质量评估任务. 本文在Summeval数据集上进行了对比实验与分析, 实验结果表明, 本评估方法显著提高了小型大语言模型在文本摘要质量评估任务上的评估准确度, 实现了一种无需参考文本、评估准确度高、计算需求低、便于部署的文本摘要质量评估方法.

    Abstract:

    Automatic text summarization is an important branch in the field of natural language processing (NLP), and one of its main difficulties lies in how to evaluate the quality of the generated summaries quickly, objectively, and accurately. Given the problems of low evaluation accuracy, the need for reference texts, and the large consumption of computing resources in the existing text summary quality evaluation methods, this study proposes an evaluation method for the quality of text summaries based on large language models. It designs a prompt construction method based on the principle of the chain of thought (CoT) to improve the performance of large language models in the evaluation of text summary quality. At the same time, a chain of thought data set is generated and a small large language model is trained in the way of model fine-tuning, significantly reducing the computing requirements. The proposed method first determines the evaluation dimension according to the characteristics of the text summary and constructs the prompt based on the principle of chain of thought. The prompt is utilized to guide the large language model to generate the chain of thought process and evaluation results based on the summary samples. Accordingly, a chain of thought data set is generated. The generated chain of thought data set is used to fine-tune and train the small large language model. Finally, the study uses the fine-tuned small-scale large language model to complete the quality evaluation of the text summary. Comparative experiments and analyses on the Summeval dataset show that this evaluation method significantly improves the evaluation accuracy of the small-scale large language model in the task of text summary quality evaluation. The study provides a text summary quality evaluation method, which is a method with high evaluation accuracy, low computing requirements, and easy deployment without reference texts.

    参考文献
    [1] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002. 311–318.
    [2] Lin CY. ROUGE: A package for automatic evaluation of summaries. Proceedings of the 2004 Text Summarization Branches Out. Barcelona: ACL, 2004. 74–81.
    [3] Rus V, Lintean M. A Comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. Proceedings of the 17th Workshop on Building Educational Applications Using NLP, Stroudsburg: ACL, 2012. 157-162.
    [4] Zhang T, Kishore V, Wu F, et al. BERTScore: Evaluating text generation with BERT. arXiv:1904.09675v3, 2020.
    [5] Lewis M, Liu YH, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Washington: ACL, 2020. 7871–7880.
    [6] Touvron H, Martin L, Stone K, et al. LLaMA 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
    [7] Bai JZ, Bai S, Chu YF, et al. Qwen technical report. arXiv:2309.16609, 2023.
    [8] Yang AY, Xiao B, Wang BN, et al. Baichuan 2: Open large-scale language models. arXiv:2309.10305, 2023.
    [9] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [10] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). Minneapolis: ACL, 2019. 4171–186.
    [11] Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Proceedings of the 34th International Conference on Neural Information Processing System. Vancouver: Curran Associates Inc., 2020. 1877–1901.
    [12] Lewis M, Liu YH, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461, 2019.
    [13] Zhao W, Peyrard M, Liu F, et al. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019. 563–578.
    [14] Deng MK, Tan BW, Liu ZZ, et al. Compression, transduction, and creation: A unified framework for evaluating natural language generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: ACL, 2021. 7580–7605.
    [15] Zhong M, Liu Y, Yin D, et al. Towards a unified multi-dimensional evaluator for text generation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi: ACL, 2022. 2023–2038.
    [16] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.
    [17] Fu JL, Ng SK, Jiang ZB, et al. GPTScore: Evaluate as you desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Mexico City: ACL, 2024. 6556–6576.
    [18] Kryściński W, Keskar NS, McCann B, et al. Neural text summarization: A critical evaluation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019. 540–551.
    [19] Fabbri AR, Kryściński W, McCann B, et al. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 2021, 9: 391–409.
    [20] Wei J, Wang XZ, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022. 24824–24837.
    [21] Kojima T, Gu SS, Reid M, et al. Large language models are zero-shot reasoners. arXiv:2205.11916, 2022.
    [22] Wang XZ, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171v4, 2023.
    [23] Li YF, Lin ZQ, Zhang SZ, et al. Making language models better reasoners with step-aware verifier. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto: ACL, 2023. 5315–5333.
    [24] Zhang ZS, Zhang A, Li M, et al. Automatic chain of thought prompting in large language models. arXiv:2210.03493, 2022.
    [25] Zhou D, Schärli N, Hou L, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv:2205.10625, 2022.
    [26] Hu EJ, Shen YL, Wallis P, et al. LoRA: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
    [27] Hu BT, Chen QC, Zhu FZ. LCSTS: A large scale Chinese short text summarization dataset. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: ACL, 2015. 1967–1972.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

谭琛瀚,贾克斌,王浩宇.基于大语言模型的文本摘要质量评估.计算机系统应用,2025,34(2):28-36

复制
分享
文章指标
  • 点击次数:155
  • 下载次数: 648
  • HTML阅读次数: 171
  • 引用次数: 0
历史
  • 收稿日期:2024-07-15
  • 最后修改日期:2024-08-13
  • 在线发布日期: 2024-12-19
文章二维码
您是第11197721位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号