数字人文环境下融入多特征的词命名实体识别

doi:10.15888/j.cnki.csa.008986

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月17日 19:23 星期四

首页 > 过刊浏览>2023年第32卷第3期 >300-308. DOI:10.15888/j.cnki.csa.008986

PDF HTML阅读 XML下载导出引用引用提醒

数字人文环境下融入多特征的词命名实体识别
DOI:
                        10.15888/j.cnki.csa.008986
                    
CSTR:
                        
                    
作者:
                        张朦张朦
中北大学 软件学院, 太原 030051
在期刊界中查找
在百度中查找
在本站中查找
刘忠宝刘忠宝
中北大学 软件学院, 太原 030051;北京语言大学 语言智能研究院, 北京 100083
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:教育部哲学社会科学研究后期项目(21JHQ081)

Named Entity Recognition of Poetry by Integrating Multi-features in Digital Humanities

Author:

ZHANG Meng
ZHANG Meng
School of Software, North University of China, Taiyuan 030051, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Zhong-Bao
LIU Zhong-Bao
School of Software, North University of China, Taiyuan 030051, China;Institute of Language Intelligence, Beijing Language and Culture University, Beijing 100083, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [24]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

近年来, 数字人文受到广泛关注, 数字人文环境下的词命名实体识别研究日渐兴起, 但鲜有研究从字特征的特征表示能力、分词的准确性、领域知识的有效性等方面进行探究. 鉴于此, 针对汉字的象形文字特点和词文本的特殊性, 在字特征的基础上, 引入部首特征、格律特征和声韵特征, 提出特征增强单元和特征抽取单元, 并将词牌知识三元组通过ANALOGY得到的知识向量表示为词牌知识向量, 通过双向长短时记忆网络、注意力机制等模型将部首向量、字向量、格律向量、声韵向量、词牌知识向量进行深度融合, 最终构建出融入多特征的词命名实体识别方法. 在《花间集全译》自制语料上的对比实验和消融实验的结果表明, 本文所提方法能够有效利用多特征提升词命名实体识别性能. 其F1值达到了85.63%, 完成了词命名实体识别任务.

关键词:命名实体识别;多特征;格律;数字人文;诗词

Abstract:

In recent years, research on the named entity recognition of poetry in digital humanities is emerging, but few studies have been conducted with regard to the feature expressiveness of character features, word segmentation accuracy, and the effectiveness of domain-specific knowledge in poetry texts. According to the characteristics of Chinese pictographs and the particularity of poetry texts, a recognition method of named poetry entities with a feature enhancement unit and a feature extraction unit is proposed, which integrates multiple features such as characters, radicals, sounds, and metrical rules. The method presents the knowledge vectors obtained from the knowledge triples of tune pattern titles through the ANALOGY model as the knowledge vectors of tune pattern titles. Then, the radical vector, character vector, metrical rule vector, sound vector, and knowledge vector of tune pattern titles are deeply fused through the bidirectional long short-term memory network and attention mechanism models. In this way, the recognition method of named poetry entities fusing multi-features is constructed. The results of comparative experiments and ablation experiments on the self-made corpus of Translation of Among Flowers (Hua Jian Ji) (《花间集全译》) show that the proposed method can effectively use multi-features to improve the recognition performance of named entities, and its F1 score reaches 85.63%, which means it completes the recognition task of named poetry entities.

Key words:named entity recognition;multi-features;metrical rule;digital humanities;poetry

参考文献

[1] Zhang Y, Li YK, Zhang J, et al. A method for place name recognition in Tang poetry based on feature templates and conditional random field. Proceedings of the 4th International Joint Conference on Web and Big Data. Tianjin: Springer, 2020. 627–635.

[2] 李章超, 李忠凯, 何琳. 《左传》战争事件抽取技术研究. 图书情报工作, 2020, 64(7): 20–29. [doi: 10.13266/j.issn.0252-3116.2020.07.003

[3] Long YF, Xiong D, Lu Q, et al. Named entity recognition for Chinese novels in the Ming-Qing dynasties. Proceedings of the 17th Chinese Lexical Semantics Workshop (CLSW 2016). Singapore: Springer, 2016. 362–375.

[4] Tang XM, Liang SC, Zheng JY, et al. Automatic recognition of allusions in Tang poetry based on BERT. Proceedings of the 2019 International Conference on Asian Language Processing. Shanghai: IEEE, 2019. 255–260.

[5] Yan CX, Wang J. Exploiting hybrid subword information for Chinese historical named entity recognition. Proceedings of the 2020 IEEE International Conference on Big Data. Atlanta: IEEE, 2020. 4795–4801.

[6] 谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别——以Flat-lattice增强的SikuBERT预训练模型为例. 图书馆论坛, 2022, 42(10): 51–60.

[7] Zhou FG, Wang C, Wang JP. Named entity recognition of ancient poems based on Albert-BiLSTM-MHA-CRF model. Wireless Communications and Mobile Computing, 2022, 2022: 6507719. [doi: 10.1155/2022/6507719

[8] Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017. 2670–2680.

[9] Yu BH, Wei JX. IDCNN-CRF-based domain named entity recognition method. Proceedings of the 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology. Weihai: IEEE, 2020. 542–546.

[10] Tan HX, Yang ZH, Ning JZ, et al. Chinese medical named entity recognition based on Chinese character radical features and pre-trained language models. Proceedings of the 2021 International Conference on Asian Language Processing (IALP). Singapore: IEEE, 2021. 121–124.

[11] Wu YF, Wei X, Qin YB, et al. A radical-based method for Chinese named entity recognition. Proceedings of the 2nd International Conference on Big Data Technologies. Jinan: Association for Computing Machinery, 2019. 125–130.

[12] 崔丹丹, 刘秀磊, 陈若愚, 等. 基于Lattice LSTM的古汉语命名实体识别. 计算机科学, 2020, 47(S2): 18–22. [doi: 10.11896/jsjkx.200500090

[13] 黄水清, 王东波. 古文信息处理研究的现状及趋势. 图书情报工作, 2017, 61(12): 43–49. [doi: 10.13266/j.issn.0252-3116.2017.12.005

[14] 苏祺, 胡韧奋, 诸雨辰, 等. 古籍数字化关键技术评述. 数字人文研究, 2021, 1(3): 83–88

[15] Li XY, Meng YX, Sun XF, et al. Is word segmentation necessary for deep learning of Chinese representations? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 3242–3252.

[16] 罗凤珠. 诗词语言切分与语意分类标记之系统设计及应用. 第四届数位典藏技术研讨会, 2005. 1–25.

[17] Mikolov T, Sutskever I, Cheng K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe: Curran Associates Inc., 2013. 3111–3119.

[18] Liu HX, Wu YX, Yang YM. Analogical inference for multi-relational embeddings. Proceedings of the 34th International Conference on Machine Learning. Sydney: JMLR.org, 2017. 2168–2178.

[19] Raffel C, Ellis DPW. Feed-forward networks with attention can solve some long-term memory problems. arXiv:1512.08756, 2015.

[20] 袁健, 章海波. 多粒度融合嵌入的中文实体识别模型. 小型微型计算机系统, 2022, 43(4): 741–746. [doi: 10.20009/j.cnki.21-1106/TP.2020-0972

[21] Xuan ZY, Bao R, Jiang SY. FGN: Fusion glyph network for Chinese named entity recognition. Proceedings of the 5th China Conference on Knowledge Graph and Semantic Computing: Knowledge Graph and Cognitive Intelligence. Nanchang: Springer, 2020. 28–40.

[22] Yang ZC, Hu ZT, Salakhutdinov R, et al. Improved variational autoencoders for text modeling using dilated convolutions. Proceedings of the 34th International Conference on Machine Learning. Sydney: PMLR, 2017. 3881–3890.

[23] 赵崇祚. 花间集全译. 崔黎明, 译. 贵阳: 贵州人民出版社, 2008.

[24] Lin YK, Liu ZY, Luan HB, et al. Modeling relation paths for representation learning of knowledge bases. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 2015. 705–714.

引用本文

张朦,刘忠宝.数字人文环境下融入多特征的词命名实体识别.计算机系统应用,2023,32(3):300-308

复制

文章指标

点击次数:867
下载次数: 1576
HTML阅读次数: 1564
引用次数: 0

历史

收稿日期:2022-08-17
最后修改日期:2022-09-15
录用日期:
在线发布日期: 2022-12-02
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码