Chinese Notional Word Discrimination Based on RoBERTa-ND
CSTR:
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [20]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Chinese notional words are combinatorial and metaphorical in nature, and there is a lack of data sets on Chinese notional word discrimination. As a result, the understanding and discriminative capability of traditional methods for Chinese notional words are still limited in machine reading comprehension tasks. For this reason, a large-scale (600k) Chinese notional word discrimination cloze data set (CND) is constructed. In the dataset, a notional word in a sentence is replaced with a blank placeholder, and the correct answer needs to be selected from the two candidate notional words provided. A baseline model, RoBERTa-based notional word discrimination model (RoBERTa-ND), is designed to select candidate words. The model first extracts semantic information in the context using a pre-trained language model. Second, the semantics of candidate notional words are fused, and the scores of candidate words are computed by a classification task. Finally, the model’s ability to discriminate Chinese notional words is further enhanced by enhancing the model’s perception of locations and orientation information. Experiments show that the model achieves the accuracy of 90.21% on CND, beating mainstream cloze test models such as DUMA (87.59%) and GNN-QA (84.23%). This work fills the gap in the research on Chinese metaphorical semantic understanding and can develop more practical value in improving the cognitive ability of Chinese Quiz Bot. The codes of CND and RoBERTa-ND are open-source: https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset.

    Reference
    [1] Cui YM, Liu T, Chen ZP, et al. Dataset for the first evaluation on Chinese machine reading comprehension. Proceedings of the 11th International Conference on Language Resources and Evaluation. Miyazaki: European Language Resources Association, 2018.
    [2] Hill F, Bordes A, Chopra S, et al. The Goldilocks principle: Reading children’s books with explicit memory representa-tions. Proceedings of the 4th International Conference on Learning Representations. San Juan: ICLR, 2016.
    [3] Zheng CJ, Huang ML, Sun AX. ChID: A large-scale Chinese idiom dataset for cloze test. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 778–787.
    [4] Cui YM, Liu T, Chen ZP, et al. Consensus attention-based neural networks for Chinese reading comprehension. Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. Osaka: The COLING 2016 Organizing Committee, 2016. 1777–1786.
    [5] Onishi T, Wang H, Bansal M, et al. Who did what: A large-scale person-centered cloze dataset. Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin: Association for Computational Linguistics, 2016. 2230–2235.
    [6] 张占山. “抱怨”和“埋怨”辨析与词典释义. 辞书研究, 2006, (3): 46–53. [doi: 10.3969/j.issn.1000-6125.2006.03.008
    [7] 张占山. 语义角色视角下的谓词同义词辨析[博士学位论文]. 厦门: 厦门大学, 2006.
    [8] Graves A. Long short-term memory. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012. 37–45.
    [9] Xu B. NLP Chinese corpus: Large scale Chinese corpus for NLP. https://github.com/safin1120/nlp_chinese_corpus. [2022-10-24].
    [10] Sun M, Li J, Guo Z, et al. THUCTC: An efficient Chinese text classifier. GitHub Repository. http://thuctc.thunlp.org/. (2016-01-25)[2022-10-24].
    [11] 张占山. 『陆续』与『连续』的区别及词典释义. 辞书研 究, 2006, (1): 68–77.
    [12] 丁美荣, 刘鸿业, 徐马一, 等. 面向机器阅读理解的多任务层次微调模型. 计算机系统应用, 2022, 31(3): 212–219. [doi: 10.15888/j.cnki.csa.008417
    [13] 盛艺暄, 兰曼. 利用外部知识辅助和多步推理的选择题型机器阅读理解模型. 计算机系统应用, 2020, 29(4): 1–9. [doi: 10.15888/j.cnki.csa.007327
    [14] Cui YM, Che WX, Liu T, et al. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504–3514. [doi: 10.1109/TASLP.2021.3124365
    [15] Liu YH, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
    [16] Lan ZZ, Chen MS, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations. Proceedings of the 8th International Conference on Learning Representations. Addis Ababa: OpenReview.net, 2020.
    [17] Kingma DP, Ba J. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations. San Diego: ICLR, 2015.
    [18] Kadlec R, Schmid M, Bajgar O, et al. Text understanding with the attention sum reader network. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, 2016. 908–918.
    [19] Zhu PF, Zhang ZS, Zhao H, et al. DUMA: Reading comprehension with transposition thinking. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 269–279
    [20] Wang K, Zhang YY, Yang DY, et al. GNN is a counter? Revisiting GNN for question answering. Proceedings of the 10th International Conference on Learning Representations. OpenReview.net, 2021.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

孙晨瑜,王振琦,张宝宇,张卫山,侯召祥,陈涛.基于RoBERTa-ND的中文实词辨析.计算机系统应用,2023,32(5):157-163

Copy
Share
Article Metrics
  • Abstract:599
  • PDF: 1120
  • HTML: 1025
  • Cited by: 0
History
  • Received:November 03,2022
  • Revised:December 10,2022
  • Online: March 17,2023
Article QR Code
You are the first990425Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063