本文已被:浏览 531次 下载 954次
Received:November 03, 2022 Revised:December 10, 2022
Received:November 03, 2022 Revised:December 10, 2022
中文摘要: 在机器阅读理解任务中, 由于中文实词的组合性和隐喻性, 且缺乏有关中文实词辨析的数据集, 因此传统方法对中文实词的理解程度和辨析能力仍然有限. 为此, 构建了一个大规模(600k)的中文实词辨析数据集(Chinese notional word discrimination cloze data set, CND). 在数据集中, 一句话中的一个实词被替换成了空白占位符, 需要从提供的两个候选实词中选择正确答案. 设计了一个基线模型RoBERTa-ND (RoBERTa-based notional word discrimination model)来对候选词进行选择. 模型首先利用预训练语言模型提取语境中的语义信息. 其次, 融合候选实词语义并通过分类任务计算候选词得分. 最后, 通过增强模型对位置及方向信息的感知, 进一步加强了模型的中文实词的辨析能力. 实验表明, 该模型在CND上准确率达到90.21%, 战胜了DUMA (87.59%), GNN-QA (84.23%)等主流的完形填空模型. 该工作填补了中文隐喻语义理解研究的空白, 可以在提高中文对话机器人认知能力等方向开发更多实用价值. 数据集CND及RoBERTa-ND代码均已开源: https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset.
Abstract:Chinese notional words are combinatorial and metaphorical in nature, and there is a lack of data sets on Chinese notional word discrimination. As a result, the understanding and discriminative capability of traditional methods for Chinese notional words are still limited in machine reading comprehension tasks. For this reason, a large-scale (600k) Chinese notional word discrimination cloze data set (CND) is constructed. In the dataset, a notional word in a sentence is replaced with a blank placeholder, and the correct answer needs to be selected from the two candidate notional words provided. A baseline model, RoBERTa-based notional word discrimination model (RoBERTa-ND), is designed to select candidate words. The model first extracts semantic information in the context using a pre-trained language model. Second, the semantics of candidate notional words are fused, and the scores of candidate words are computed by a classification task. Finally, the model’s ability to discriminate Chinese notional words is further enhanced by enhancing the model’s perception of locations and orientation information. Experiments show that the model achieves the accuracy of 90.21% on CND, beating mainstream cloze test models such as DUMA (87.59%) and GNN-QA (84.23%). This work fills the gap in the research on Chinese metaphorical semantic understanding and can develop more practical value in improving the cognitive ability of Chinese Quiz Bot. The codes of CND and RoBERTa-ND are open-source: https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset.
keywords: metaphorical semantic understanding Chinese notional word discrimination machine reading comprehension
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(62072469);中国科学院自动化研究所复杂系统管理与控制国家重点实验室2021年开放课题(20210114)
引用文本:
孙晨瑜,王振琦,张宝宇,张卫山,侯召祥,陈涛.基于RoBERTa-ND的中文实词辨析.计算机系统应用,2023,32(5):157-163
SUN Chen-Yu,WANG Zhen-Qi,ZHANG Bao-Yu,ZHANG Wei-Shan,HOU Zhao-Xiang,CHEN Tao.Chinese Notional Word Discrimination Based on RoBERTa-ND.COMPUTER SYSTEMS APPLICATIONS,2023,32(5):157-163
孙晨瑜,王振琦,张宝宇,张卫山,侯召祥,陈涛.基于RoBERTa-ND的中文实词辨析.计算机系统应用,2023,32(5):157-163
SUN Chen-Yu,WANG Zhen-Qi,ZHANG Bao-Yu,ZHANG Wei-Shan,HOU Zhao-Xiang,CHEN Tao.Chinese Notional Word Discrimination Based on RoBERTa-ND.COMPUTER SYSTEMS APPLICATIONS,2023,32(5):157-163