数字人文环境下融入多特征的词命名实体识别
作者:
基金项目:

教育部哲学社会科学研究后期项目(21JHQ081)


Named Entity Recognition of Poetry by Integrating Multi-features in Digital Humanities
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [24]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    近年来, 数字人文受到广泛关注, 数字人文环境下的词命名实体识别研究日渐兴起, 但鲜有研究从字特征的特征表示能力、分词的准确性、领域知识的有效性等方面进行探究. 鉴于此, 针对汉字的象形文字特点和词文本的特殊性, 在字特征的基础上, 引入部首特征、格律特征和声韵特征, 提出特征增强单元和特征抽取单元, 并将词牌知识三元组通过ANALOGY得到的知识向量表示为词牌知识向量, 通过双向长短时记忆网络、注意力机制等模型将部首向量、字向量、格律向量、声韵向量、词牌知识向量进行深度融合, 最终构建出融入多特征的词命名实体识别方法. 在《花间集全译》自制语料上的对比实验和消融实验的结果表明, 本文所提方法能够有效利用多特征提升词命名实体识别性能. 其F1值达到了85.63%, 完成了词命名实体识别任务.

    Abstract:

    In recent years, research on the named entity recognition of poetry in digital humanities is emerging, but few studies have been conducted with regard to the feature expressiveness of character features, word segmentation accuracy, and the effectiveness of domain-specific knowledge in poetry texts. According to the characteristics of Chinese pictographs and the particularity of poetry texts, a recognition method of named poetry entities with a feature enhancement unit and a feature extraction unit is proposed, which integrates multiple features such as characters, radicals, sounds, and metrical rules. The method presents the knowledge vectors obtained from the knowledge triples of tune pattern titles through the ANALOGY model as the knowledge vectors of tune pattern titles. Then, the radical vector, character vector, metrical rule vector, sound vector, and knowledge vector of tune pattern titles are deeply fused through the bidirectional long short-term memory network and attention mechanism models. In this way, the recognition method of named poetry entities fusing multi-features is constructed. The results of comparative experiments and ablation experiments on the self-made corpus of Translation of Among Flowers (Hua Jian Ji) (《花间集全译》) show that the proposed method can effectively use multi-features to improve the recognition performance of named entities, and its F1 score reaches 85.63%, which means it completes the recognition task of named poetry entities.

    参考文献
    [1] Zhang Y, Li YK, Zhang J, et al. A method for place name recognition in Tang poetry based on feature templates and conditional random field. Proceedings of the 4th International Joint Conference on Web and Big Data. Tianjin: Springer, 2020. 627–635.
    [2] 李章超, 李忠凯, 何琳. 《左传》战争事件抽取技术研究. 图书情报工作, 2020, 64(7): 20–29. [doi: 10.13266/j.issn.0252-3116.2020.07.003
    [3] Long YF, Xiong D, Lu Q, et al. Named entity recognition for Chinese novels in the Ming-Qing dynasties. Proceedings of the 17th Chinese Lexical Semantics Workshop (CLSW 2016). Singapore: Springer, 2016. 362–375.
    [4] Tang XM, Liang SC, Zheng JY, et al. Automatic recognition of allusions in Tang poetry based on BERT. Proceedings of the 2019 International Conference on Asian Language Processing. Shanghai: IEEE, 2019. 255–260.
    [5] Yan CX, Wang J. Exploiting hybrid subword information for Chinese historical named entity recognition. Proceedings of the 2020 IEEE International Conference on Big Data. Atlanta: IEEE, 2020. 4795–4801.
    [6] 谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别——以Flat-lattice增强的SikuBERT预训练模型为例. 图书馆论坛, 2022, 42(10): 51–60.
    [7] Zhou FG, Wang C, Wang JP. Named entity recognition of ancient poems based on Albert-BiLSTM-MHA-CRF model. Wireless Communications and Mobile Computing, 2022, 2022: 6507719. [doi: 10.1155/2022/6507719
    [8] Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017. 2670–2680.
    [9] Yu BH, Wei JX. IDCNN-CRF-based domain named entity recognition method. Proceedings of the 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology. Weihai: IEEE, 2020. 542–546.
    [10] Tan HX, Yang ZH, Ning JZ, et al. Chinese medical named entity recognition based on Chinese character radical features and pre-trained language models. Proceedings of the 2021 International Conference on Asian Language Processing (IALP). Singapore: IEEE, 2021. 121–124.
    [11] Wu YF, Wei X, Qin YB, et al. A radical-based method for Chinese named entity recognition. Proceedings of the 2nd International Conference on Big Data Technologies. Jinan: Association for Computing Machinery, 2019. 125–130.
    [12] 崔丹丹, 刘秀磊, 陈若愚, 等. 基于Lattice LSTM的古汉语命名实体识别. 计算机科学, 2020, 47(S2): 18–22. [doi: 10.11896/jsjkx.200500090
    [13] 黄水清, 王东波. 古文信息处理研究的现状及趋势. 图书情报工作, 2017, 61(12): 43–49. [doi: 10.13266/j.issn.0252-3116.2017.12.005
    [14] 苏祺, 胡韧奋, 诸雨辰, 等. 古籍数字化关键技术评述. 数字人文研究, 2021, 1(3): 83–88
    [15] Li XY, Meng YX, Sun XF, et al. Is word segmentation necessary for deep learning of Chinese representations? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 3242–3252.
    [16] 罗凤珠. 诗词语言切分与语意分类标记之系统设计及应用. 第四届数位典藏技术研讨会, 2005. 1–25.
    [17] Mikolov T, Sutskever I, Cheng K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe: Curran Associates Inc., 2013. 3111–3119.
    [18] Liu HX, Wu YX, Yang YM. Analogical inference for multi-relational embeddings. Proceedings of the 34th International Conference on Machine Learning. Sydney: JMLR.org, 2017. 2168–2178.
    [19] Raffel C, Ellis DPW. Feed-forward networks with attention can solve some long-term memory problems. arXiv:1512.08756, 2015.
    [20] 袁健, 章海波. 多粒度融合嵌入的中文实体识别模型. 小型微型计算机系统, 2022, 43(4): 741–746. [doi: 10.20009/j.cnki.21-1106/TP.2020-0972
    [21] Xuan ZY, Bao R, Jiang SY. FGN: Fusion glyph network for Chinese named entity recognition. Proceedings of the 5th China Conference on Knowledge Graph and Semantic Computing: Knowledge Graph and Cognitive Intelligence. Nanchang: Springer, 2020. 28–40.
    [22] Yang ZC, Hu ZT, Salakhutdinov R, et al. Improved variational autoencoders for text modeling using dilated convolutions. Proceedings of the 34th International Conference on Machine Learning. Sydney: PMLR, 2017. 3881–3890.
    [23] 赵崇祚. 花间集全译. 崔黎明, 译. 贵阳: 贵州人民出版社, 2008.
    [24] Lin YK, Liu ZY, Luan HB, et al. Modeling relation paths for representation learning of knowledge bases. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 2015. 705–714.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

张朦,刘忠宝.数字人文环境下融入多特征的词命名实体识别.计算机系统应用,2023,32(3):300-308

复制
分享
文章指标
  • 点击次数:867
  • 下载次数: 1576
  • HTML阅读次数: 1564
  • 引用次数: 0
历史
  • 收稿日期:2022-08-17
  • 最后修改日期:2022-09-15
  • 在线发布日期: 2022-12-02
文章二维码
您是第11369896位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号