基于RoBERTa和T5的两阶段医学术语标准化

doi:10.15888/j.cnki.csa.009370

AIPUB归智期刊联盟

微信公众号

网站二维码

首页 > 过刊浏览>2024年第33卷第1期 >280-288. DOI:10.15888/j.cnki.csa.009370

PDF HTML阅读 XML下载导出引用引用提醒

基于RoBERTa和T5的两阶段医学术语标准化
DOI:
                        10.15888/j.cnki.csa.009370
                    
CSTR:
                        32024.14.csa.009370
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Two-stage Medical Terminology Standardization Based on RoBERTa and T5

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

医学术语标准化作为消除实体歧义性的重要手段, 被广泛应用于知识图谱的构建过程之中. 针对医学领域涉及大量的专业术语和复杂的表述方式, 传统匹配模型往往难以达到较高的准确率的问题, 提出语义召回加精准排序的两阶段模型来提升医学术语标准化效果. 首先在语义召回阶段基于改进的有监督对比学习和RoBERTa-wwm提出语义表征模型CL-BERT, 通过CL-BERT生成实体的语义表征向量, 根据向量之间的余弦相似度进行召回并得到标准词候选集, 其次在精准排序阶段使用T5结合prompt tuning构建语义精准匹配模型, 并将FGM对抗训练应用到模型训练中, 然后使用精准匹配模型对原词和标准词候选集分别进行精准排序得到最终标准词. 采用ccks2019公开数据集进行实验, F1值达到了0.9206, 实验结果表明所提出的两阶段模型具有较高的性能, 为实现医学术语标准化提供了新思路.

Abstract:

Medical terminology standardization, as an important means to eliminate entity ambiguity, is widely used in the process of building knowledge graphs. Aiming at the problem that the medical field involves a large number of professional terminology and complex expressions, and the traditional matching models are often difficult to achieve a high accuracy rate, a two-stage model of semantic recall and precise sorting is proposed to improve the standardization effect of medical terminology. First, in the semantic recall stage, a semantic representation model CL-BERT is proposed based on the improved supervised contrastive learning and RoBERTa-wwm. The semantic representation vector of an entity is generated through CL-BERT, and recall is carried out according to the cosine similarity between the vectors, so as to obtain the standard word candidate set. Secondly, in the precise sorting stage, T5, combined with prompt tuning, is used to build a precise semantic matching model, and FGM confrontation training is applied to the model training; next, the precise matching model is used to precisely sort the original word and standard word candidate sets, so as to obtain the final standard words. The ccks2019 public data set is used for experiments, achieving an F1 value of 0.920 6. The experimental results show that the proposed two-stage model showcases high performance, and provides a new idea for medical terminology standardization.

参考文献

相似文献

引证文献

引用本文

周景,崔灿灿,王梦迪,王泽敏.基于RoBERTa和T5的两阶段医学术语标准化.计算机系统应用,2024,33(1):280-288

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-05-18
最后修改日期:2023-06-26
录用日期:
在线发布日期: 2023-11-24
出版日期: 2023-01-05

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码