﻿ 在线医疗问答文本的命名实体识别
 计算机系统应用  2019, Vol. 28 Issue (2): 8-14 PDF

Named Entity Recognition of Online Medical Question Answering Text
YANG Wen-Ming, CHU Wei-Jie
School of Software & Microelectronics, Peking University, Beijing 102600, China
Abstract: This paper mainly presents the research of named entity recognition of medical texts generated by online inquiry. Using the data of online medical quiz website, we employ {B, I, O} annotation system to build data sets, and extract four medical entities of disease, treatment, examination, and symptom. Taking BiLSTM-CRF as the benchmark model, two deep learning models IndRNN-CRF and IDCNN-BiLSTM-CRF are proposed, and the validity of the model on the self built dataset is verified. The two new models are compared with the benchmark model by experiment. It is concluded that the model IDCNN-BiLSTM-CRF has an F1 value of 0.8165, which exceeds the BiLSTM-CRF’s F1 value of 0.8009. The overall performance of IDCNN-BiLSTM-CRF is better than that of BiLSTM-CRF. The IndRNN-CRF model has a high precision rate of 0.8427, but its recall rate is lower than the benchmark model BiLSTM-CRF.
Key words: medical question and answer     deep learning     Independent Recurrent Neural Network (IndRNN)     dilation convolution     bi-directional RNN

1 引言

2 命名实体识别相关研究

3 算法模型设计

3.1 BiLSTM-CRF模型

 $P(y|x) = \frac{1}{{Z(x)}}\exp (\sum\nolimits_{k = 1}^K {w_k} f_k(y, x))$ (1)
 $Z(x) = \sum\limits_y {\exp \sum\nolimits_{k = 1}^K {w_kf_k(y, x)} }$ (2)

 $score(X, y) = \sum {\sum\nolimits_{i = 0}^n {A_{y_i, y_{i + 1}} + \sum\nolimits_{i = 1}^n {P_{i, y_i}} } }$ (3)

Ai, j是看成转移打分矩阵, 代表从标注i转移到标注j的得分. y0yn分别代表句子开始和结束的标签, 标注矩阵A是一个k+2阶的方阵. 通过式(4)计算y在给定x下的条件概率p(y|x), 其中YX代表对于给定的句子X所有可能的标签序列, 损失函数可以定义为式(5), 并在训练的过程中极大化正确标签序列概率的对数值.

 $P(y|X) = \frac{{\exp (score(X, y))}}{{\sum\nolimits_{\tilde y \in YX} {\exp (X, \tilde y )} }}$ (4)
 $L = \log (P(y|X))$ (5)

 $y* = \mathop {\arg \max }\limits_{\tilde y \in YX} score(X, \tilde y )$ (6)

 图 1 BiLSTM-CRF模型结构图

3.2 IndRNN-CRF模型

 $h_t = \sigma (Wx_t + Uh_t - 1 + b)$ (7)

 $h_t = \sigma (Wx_t + U \otimes h_{t - 1} + b)$ (8)

 $h_{n, t} = \sigma (w_nx_t + u_nh_{n, t - 1} + b_n)$ (9)

 图 2 4-IndRNN-CRF结构图

3.3 IDCNN-BiLSTM-CRF模型

 图 3 DCNN示意图

Strubell E等人[7]提出了IDCNN(Iterated Dilated Convolution, IDCNN)模型, 用在实体识别任务上取得了不错的效果. 膨胀的宽度随着层数的增加呈现为指数增加, 但参数的数量是线性增加的, 这样接受域很快就覆盖到了全部的输入数据. 模型是4个大小相同的膨胀卷积块叠加在一起, 每个膨胀卷积块里的膨胀宽度分别为1, 1, 2的三层膨胀卷积. 把句子输入到IDCNN模型中, 经过卷积层, 提取特征, 其基本框架同BiLSTM-CER一样, 由IDCNN模型的输出经过映射层连接到CRF层.

 图 4 IDCNN-BiLSTM-CRF模型结构图

4 数据处理和标注

5 实验结果和分析 5.1 实验条件

5.2 模型参数设置

5.3 实验结果和分析

 $Precision(P)=\frac {\text{系统正确识别的实体个数}}{\text{系统识别的实体个数}}$ (10)
 ${ {Re} } call(R)=\frac {\text{系统正确识别的实体个数}}{\text{文档中的实体个数}}$ (11)
 $F - measure = \frac{{2 \times P \times R}}{{P + R}}$ (12)

 图 5 BiLSTM-CRF的loss-step曲线图

 图 6 IndRNN-CRF的loss-step曲线图

 图 7 IDCNN-BILSTM-CRF的loss-step曲线

6 结论与展望

 [1] Sundheim BM. Named entity task definition, version 2.1. Proceedings of the Sixth Message Understanding Conference. Columbia, MA, USA. 1995. 319–332. [2] Huang ZH, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv: 1508.01991, 2015. [3] 苏娅, 刘杰, 黄亚楼. 在线医疗文本中的实体识别研究. 北京大学学报(自然科学版), 2016, 52(1): 1-9. [4] 张帆, 王敏. 基于深度学习的医疗命名实体识别. 计算技术与自动化, 2017, 36(1): 123-127. DOI:10.3969/j.issn.1003-6199.2017.01.025 [5] Li S, Li WQ, Cook C, et al. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. arXiv preprint arXiv: 1803.04831, 2018. [6] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv: 1511.07122, 2016. [7] Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions. Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark. 2017. 2670–2680. [8] 杨锦锋, 于秋滨, 关毅, 等. 电子病历命名实体识别和实体关系抽取研究综述. 自动化学报, 2014, 40(8): 1537-1562. [9] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. ICLR: Proceeding of the International Conference on Learning Representations Workshop Track. AZ, USA. 2013. 1301–3781. [10] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2013. 3111–3119. [11] Kenter T, Borisov A, de Rijke M. Siamese CBOW: Optimizing word embeddings for sentence representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany. 2016. 941–951.