基于孪生网络和字词向量结合的文本相似度匹配
作者:

Similar Text Matching Based on Siamese Network and Char-word Vector Combination
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [15]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    文本相似度匹配是许多自然语言处理任务的基础, 本文提出一种基于孪生网络和字词向量结合的文本相似度匹配方法, 采用孪生网络的思想对文本整体建模, 实现两个文本的相似性判断. 首先, 在提取文本特征向量时, 使用BERT和WoBERT模型分别提取字和词级别的句向量, 将二者结合使句向量具有更丰富的文本语义信息; 其次, 针对特征信息融合过程中出现的维度过大问题, 加入PCA算法对高维向量进行降维, 去除冗余信息和噪声干扰; 最后, 通过Softmax分类器得到相似度匹配结果. 通过在LCQMC数据集上的实验表明, 本文模型的准确率和F1值分别达到了89.92%和88.52%, 可以更好地提取文本语义信息, 更适合文本相似度匹配任务.

    Abstract:

    Text similarity matching is the basis of many natural language processing tasks. This study proposes a text similarity matching method based on a Siamese network and char-word vector combination. The method adopts the idea of the Siamese network to model the overall text so that the text similarity can be determined. First, when text feature vectors are extracted, BERT and WoBERT models are used to extract character-level and word-level sentence vectors which are then combined to have richer text semantic information. If the dimension is too large during feature information fusion, the principal component analysis (PCA) algorithm is employed for the dimension reduction of high-dimensional vectors to remove the interference of redundant information and noise. Finally, the similarity matching result is obtained through the Softmax classifier. The experimental results on the LCQMC dataset show that the accuracy and F1 score of the model in this study reach 89.92% and 88.52%, respectively, which can better extract text semantic information and is more suitable for text similarity matching tasks.

    参考文献
    [1] 董自涛, 包佃清, 马小虎. 智能问答系统中问句相似度计算方法. 武汉理工大学学报·信息与管理工程版, 2010, 32(1): 31–34
    [2] Singh V, Dwivedi SK. Personalized approach for automated question answering in restricted domain. International Journal of Information Technology, 2020, 12(1): 223–229. [doi: 10.1007/s41870-018-0200-6
    [3] 王灿辉, 张敏, 马少平. 自然语言处理在信息检索中的应用综述. 中文信息学报, 2007, 21(2): 35–45. [doi: 10.3969/j.issn.1003-0077.2007.02.006
    [4] 贾晓婷, 王名扬, 曹宇. 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究. 数据分析与知识发现, 2018, 2(2): 86–95
    [5] Wang Q, Li B, Xiao T, et al. Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 1810–1822.
    [6] 程传鹏, 齐晖. 文本相似度计算在主观题评分中的应用. 计算机工程, 2012, 38(5): 288–290. [doi: 10.3969/j.issn.1000-3428.2012.05.089
    [7] Harris ZS. Papers in structural and transformational linguistics. Dordrecht: Springer, 1970: 466–473
    [8] Hofmann T. Probabilistic latent semantic analysis. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm: Morgan Kaufmann Publishers Inc., 1999. 289–296.
    [9] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc. , 2017. 6000–6010.
    [10] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.
    [11] 苏剑林. 提速不掉点: 基于词颗粒度的中文WoBERT. https://kexue.fm/archives/7758. (2020-09-18).
    [12] Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: Association for Computational Linguistics, 2019. 3982–3992.
    [13] Su JL, Cao JR, Liu WJ, et al. Whitening sentence representations for better semantics and faster retrieval. arXiv: 2103.15316, 2021.
    [14] Palangi H, Deng L, Shen Y, et al. Semantic modelling with long-short-term memory for information retrieval. arXiv: 1412.6629, 2014.
    [15] 彭浩然. 面向检索式问答的问句语义匹配方法研究[硕士学位论文]. 哈尔滨: 哈尔滨工业大学, 2020.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李奕霖,周艳平.基于孪生网络和字词向量结合的文本相似度匹配.计算机系统应用,2022,31(10):295-302

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-01-21
  • 最后修改日期:2022-02-22
  • 在线发布日期: 2022-06-24
文章二维码
您是第11304776位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号