###
计算机系统应用英文版:2024,33(1):280-288
本文二维码信息
码上扫一扫!
基于RoBERTa和T5的两阶段医学术语标准化
(1.华北电力大学 控制与计算机工程学院, 北京 102206;2.北京中科睿见科技有限公司, 北京 100080)
Two-stage Medical Terminology Standardization Based on RoBERTa and T5
(1.School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China;2.Beijing Smart Insight Technology Co. Ltd., Beijing 100080, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 252次   下载 612
Received:May 18, 2023    Revised:June 26, 2023
中文摘要: 医学术语标准化作为消除实体歧义性的重要手段, 被广泛应用于知识图谱的构建过程之中. 针对医学领域涉及大量的专业术语和复杂的表述方式, 传统匹配模型往往难以达到较高的准确率的问题, 提出语义召回加精准排序的两阶段模型来提升医学术语标准化效果. 首先在语义召回阶段基于改进的有监督对比学习和RoBERTa-wwm提出语义表征模型CL-BERT, 通过CL-BERT生成实体的语义表征向量, 根据向量之间的余弦相似度进行召回并得到标准词候选集, 其次在精准排序阶段使用T5结合prompt tuning构建语义精准匹配模型, 并将FGM对抗训练应用到模型训练中, 然后使用精准匹配模型对原词和标准词候选集分别进行精准排序得到最终标准词. 采用ccks2019公开数据集进行实验, F1值达到了0.9206, 实验结果表明所提出的两阶段模型具有较高的性能, 为实现医学术语标准化提供了新思路.
Abstract:Medical terminology standardization, as an important means to eliminate entity ambiguity, is widely used in the process of building knowledge graphs. Aiming at the problem that the medical field involves a large number of professional terminology and complex expressions, and the traditional matching models are often difficult to achieve a high accuracy rate, a two-stage model of semantic recall and precise sorting is proposed to improve the standardization effect of medical terminology. First, in the semantic recall stage, a semantic representation model CL-BERT is proposed based on the improved supervised contrastive learning and RoBERTa-wwm. The semantic representation vector of an entity is generated through CL-BERT, and recall is carried out according to the cosine similarity between the vectors, so as to obtain the standard word candidate set. Secondly, in the precise sorting stage, T5, combined with prompt tuning, is used to build a precise semantic matching model, and FGM confrontation training is applied to the model training; next, the precise matching model is used to precisely sort the original word and standard word candidate sets, so as to obtain the final standard words. The ccks2019 public data set is used for experiments, achieving an F1 value of 0.920 6. The experimental results show that the proposed two-stage model showcases high performance, and provides a new idea for medical terminology standardization.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
周景,崔灿灿,王梦迪,王泽敏.基于RoBERTa和T5的两阶段医学术语标准化.计算机系统应用,2024,33(1):280-288
ZHOU Jing,CUI Can-Can,WANG Meng-Di,WANG Ze-Min.Two-stage Medical Terminology Standardization Based on RoBERTa and T5.COMPUTER SYSTEMS APPLICATIONS,2024,33(1):280-288