基于知识增强的中文电子病历命名实体识别
作者:

Knowledge-enhanced Named Entity Recognition for Chinese Electronic Medical Records
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [17]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    针对中文电子病历中医疗嵌套实体难以处理的问题, 本文基于RoBERTa-wwm-ext-large预训练模型提出一种知识增强的中文电子病历命名实体识别模型ERBEGP. RoBERTa-wwm-ext-large采用的全词掩码策略能够获得词级别的语义表示, 更适用于中文文本. 首先结合知识图谱, 使模型学习到了大量的医疗实体名词, 进一步提高模型对电子病历实体识别的准确性. 然后通过BiLSTM对电子病历输入序列编码, 能够更好捕获病历的中上下语义信息. 最后利用全局指针网络模型EGP (efficient GlobalPointer)同时考虑实体的头部和尾部的特征信息来预测嵌套实体, 更加有效地解决中文电子病历命名实体识别任务中嵌套实体难以处理的问题. 在CBLUE中的4个数据集上本文方法均取得了更好的识别效果, 证明了ERBEGP模型的有效性.

    Abstract:

    Regarding the challenge of handling nested medical entities in Chinese electronic medical records, this study proposes a knowledge-enhanced named entity recognition model for Chinese electronic medical records called ERBEGP based on the RoBERTa-wwm-ext-large pre-trained model. The comprehensive word masking strategy employed by the RoBERTa-wwm-ext-large model can obtain semantic representations at the word level, which is more suitable for Chinese texts. First, the model learns a significant number of medical entity nouns by integrating knowledge graphs, further improving entity recognition accuracy in electronic medical records. Then, the contextual semantic information within the records can be better captured through BiLSTM encoding of the input sequence of medical records. Finally, the efficient GlobalPointer (EGP) model is adopted to simultaneously consider the features of both the head and tail of entities to predict nested entities, addressing the challenge of handling nested entities in named entity recognition tasks of Chinese electronic medical records. The effectiveness of the ERBEGP model is demonstrated by yielding better recognition results on the four datasets within CBLUE.

    参考文献
    [1] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019. 4171–4186.
    [2] 李正民, 云红艳, 王翊臻. 基于BERT的多特征融合的医疗命名实体识别. 青岛大学学报(自然科学版), 2021, 34(4): 23–29. [doi: 10.3969/j.issn.1006-1037.2021.11.05
    [3] 赵奎, 杜昕娉, 高延军, 等. 融合文字与标签的电子病历命名实体识别. 计算机系统应用, 2022, 31(10): 375–381. [doi: 10.15888/j.cnki.csa.008723
    [4] 张芳丛, 秦秋莉, 姜勇, 等. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究. 数据分析与知识发现, 2022, 6(2–3): 251–262.
    [5] Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36(4): 1234–1240. [doi: 10.1093/bioinformatics/btz682
    [6] Su JL, Murtadha A, Pan SF, et al. Global pointer: Novel efficient span-based approach for named entity recognition. arXiv:2208.03054, 2022.
    [7] Zhang NY, Jia QH, Yin KP, et al. Conceptualized representation learning for Chinese biomedical text mining. arXiv:2008.10813, 2020.
    [8] Rasmy L, Xiang Y, Xie ZQ, et al. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Medicine, 2021, 4(1): 86. [doi: 10.1038/s41746-021-00455-y
    [9] 杨飞洪. 面向中文临床自然语言处理的BERT模型研究[硕士学位论文]. 北京: 北京协和医学院, 2021.
    [10] Cui YM, Che WX, Liu T, et al. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504–3514. [doi: 10.1109/TASLP.2021.3124365
    [11] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780.
    [12] Che WX, Li ZH, Liu T. LTP: A Chinese language technology platform. Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Beijing: ACL, 2010. 13–16.
    [13] Zhang W, Wong CM, Ye GQ, et al. Billion-scale pre-trained e-commerce product knowledge graph model. Proceedings of the 37th IEEE International Conference on Data Engineering. Chania: IEEE, 2021. 2476–2487.
    [14] Liu WJ, Zhou P, Zhao Z, et al. K-BERT: Enabling language representation with knowledge graph. Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York: AAAI Press, 2020. 2901–2908.
    [15] Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning. Williamstown, 2001. 282–289.
    [16] Zhang NY, Chen MS, Bi Z, et al. CBLUE: A Chinese biomedical language understanding evaluation benchmark. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: ACL, 2022. 7888–7915.
    [17] Lan ZZ, Chen MD, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations. Proceedings of the 8th International Conference on Learning Representations. Addis Ababa: OpenReview.net, 2020.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李宛泽,宋波,齐岳山.基于知识增强的中文电子病历命名实体识别.计算机系统应用,2023,32(12):112-119

复制
分享
文章指标
  • 点击次数:1097
  • 下载次数: 2235
  • HTML阅读次数: 1319
  • 引用次数: 0
历史
  • 收稿日期:2023-05-22
  • 最后修改日期:2023-06-28
  • 在线发布日期: 2023-09-22
文章二维码
您是第11207641位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号