###
计算机系统应用英文版:2022,31(2):291-297
本文二维码信息
码上扫一扫!
长文本匹配LTM-B模型
(湘潭大学 计算机学院·网络空间安全学院, 湘潭 411105)
LTM-B Model of Long Text Matching
(School of Computer Science & School of Cyberspace Science, Xiangtan University, Xiangtan 411105, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 587次   下载 1297
Received:April 25, 2021    Revised:May 19, 2021
中文摘要: 长文本匹配是自然语言处理的一项基础工作, 在文本聚类、新闻推荐等方面有着关键作用. 受语料、篇幅结构、文本表示技术的限制, 长文本匹配工作进展缓慢. 近年提出的BERT模型在文本表示方面具有非常卓越的表现, 而对于BERT来说, 长文本的处理有截断法、分段法和压缩法3种常用方式, 截断法丢失大量文本信息, 分段法保留文本信息却丢失部分语义信息, 压缩法可能丢失部分关键信息. 针对以上问题, 本文对分段法加以改进, 提出一种基于BERT的长文本匹配模型(long text matching model based on BERT, LTM-B), 它以孪生网络为基础, 采用分层的思想将文档切分成多个分段, 使用BERT模型处理文本向量化, 从而得到文档的矩阵表示, 并采用BiLSTM产生位置矩阵, 然后将文档矩阵及其位置矩阵求和输入至Transformer编码器进行特征提取, 最后将两个文档矩阵进行交互、池化、拼接后经由全连接层分类输出匹配结果. 实验表明, 相比于其他方法, LTM-B模型在长文本匹配问题上拥有更好的表现.
中文关键词: 长文本匹配  BERT  孪生网络  BiLSTM  Transformer
Abstract:Long text matching is a basic work of natural language processing, and it plays a key role in text clustering, news recommendation, etc. Due to the limitations of the corpus, space structure, and text representation technology, long text matching has been progressing slowly. The bidirectional encoder representations from Transformer (BERT) model proposed in recent years has an excellent performance in the text representation. For BERT, there are three common methods for processing long texts: truncation, segmentation, and compression. The truncation method causes the loss of massive text information; the segmentation method retains text information but loses part of the semantic information; the compression method may lose part of the key information. In response to the above problems, this study improves the segmentation method and proposes a long text matching model based on BERT (LTM-B), which is based on the Siamese neural network and adopts a layered idea to divide the document into multiple segments. The BERT model is used for text vectorization. As a result, the matrix representation of the document is obtained. The bidirectional long short-term memory (BiLSTM) is employed to generate the position matrix, and then the sum of the document matrix and the position matrix is input to the Transformer encoder for feature extraction. Finally, the two matrices are interacted, pooled, and spliced, and then the matching results are output through the fully connected layer classification. Experiments show that the LTM-B model outperforms other methods in long text matching.
文章编号:     中图分类号:    文献标志码:
基金项目:湖南省重点研发项目(2022SK2106)
引用文本:
刘龙,刘新,蔡林杰,唐朝.长文本匹配LTM-B模型.计算机系统应用,2022,31(2):291-297
LIU Long,LIU Xin,CAI Lin-Jie,TANG Chao.LTM-B Model of Long Text Matching.COMPUTER SYSTEMS APPLICATIONS,2022,31(2):291-297