基于RoBERTa-ND的中文实词辨析
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(62072469);中国科学院自动化研究所复杂系统管理与控制国家重点实验室2021年开放课题(20210114)


Chinese Notional Word Discrimination Based on RoBERTa-ND
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在机器阅读理解任务中, 由于中文实词的组合性和隐喻性, 且缺乏有关中文实词辨析的数据集, 因此传统方法对中文实词的理解程度和辨析能力仍然有限. 为此, 构建了一个大规模(600k)的中文实词辨析数据集(Chinese notional word discrimination cloze data set, CND). 在数据集中, 一句话中的一个实词被替换成了空白占位符, 需要从提供的两个候选实词中选择正确答案. 设计了一个基线模型RoBERTa-ND (RoBERTa-based notional word discrimination model)来对候选词进行选择. 模型首先利用预训练语言模型提取语境中的语义信息. 其次, 融合候选实词语义并通过分类任务计算候选词得分. 最后, 通过增强模型对位置及方向信息的感知, 进一步加强了模型的中文实词的辨析能力. 实验表明, 该模型在CND上准确率达到90.21%, 战胜了DUMA (87.59%), GNN-QA (84.23%)等主流的完形填空模型. 该工作填补了中文隐喻语义理解研究的空白, 可以在提高中文对话机器人认知能力等方向开发更多实用价值. 数据集CND及RoBERTa-ND代码均已开源: https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset.

    Abstract:

    Chinese notional words are combinatorial and metaphorical in nature, and there is a lack of data sets on Chinese notional word discrimination. As a result, the understanding and discriminative capability of traditional methods for Chinese notional words are still limited in machine reading comprehension tasks. For this reason, a large-scale (600k) Chinese notional word discrimination cloze data set (CND) is constructed. In the dataset, a notional word in a sentence is replaced with a blank placeholder, and the correct answer needs to be selected from the two candidate notional words provided. A baseline model, RoBERTa-based notional word discrimination model (RoBERTa-ND), is designed to select candidate words. The model first extracts semantic information in the context using a pre-trained language model. Second, the semantics of candidate notional words are fused, and the scores of candidate words are computed by a classification task. Finally, the model’s ability to discriminate Chinese notional words is further enhanced by enhancing the model’s perception of locations and orientation information. Experiments show that the model achieves the accuracy of 90.21% on CND, beating mainstream cloze test models such as DUMA (87.59%) and GNN-QA (84.23%). This work fills the gap in the research on Chinese metaphorical semantic understanding and can develop more practical value in improving the cognitive ability of Chinese Quiz Bot. The codes of CND and RoBERTa-ND are open-source: https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset.

    参考文献
    相似文献
    引证文献
引用本文

孙晨瑜,王振琦,张宝宇,张卫山,侯召祥,陈涛.基于RoBERTa-ND的中文实词辨析.计算机系统应用,2023,32(5):157-163

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-11-03
  • 最后修改日期:2022-12-10
  • 录用日期:
  • 在线发布日期: 2023-03-17
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号