基于改进的Jaccard系数文档相似度计算方法

doi:10.15888/j.cnki.csa.006123

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月14日 15:43 星期一

首页 > 过刊浏览>2017年第26卷第12期 >137-142. DOI:10.15888/j.cnki.csa.006123

PDF HTML阅读 XML下载导出引用引用提醒

基于改进的Jaccard系数文档相似度计算方法
DOI:
                        10.15888/j.cnki.csa.006123
                    
CSTR:
                        
                    
作者:
                        俞婷婷俞婷婷
福建师范大学 软件学院, 福州 350108
在期刊界中查找
在百度中查找
在本站中查找
徐彭娜徐彭娜
福建师范大学 软件学院, 福州 350108
在期刊界中查找
在百度中查找
在本站中查找
江育娥江育娥
福建师范大学 软件学院, 福州 350108
在期刊界中查找
在百度中查找
在本站中查找
林劼林劼
福建师范大学 软件学院, 福州 350108
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金（61472082）；福建省自然科学基金（2014J01220）

Text Similarity Method Based on the Improved Jaccard Coefficient

Author:

YU Ting-Ting
YU Ting-Ting
Faculty of Software, Fujian Normal University, Fuzhou 350108, China
在期刊界中查找
在百度中查找
在本站中查找
XU Peng-Na
XU Peng-Na
Faculty of Software, Fujian Normal University, Fuzhou 350108, China
在期刊界中查找
在百度中查找
在本站中查找
JIANG Yu-E
JIANG Yu-E
Faculty of Software, Fujian Normal University, Fuzhou 350108, China
在期刊界中查找
在百度中查找
在本站中查找
LIN Jie
LIN Jie
Faculty of Software, Fujian Normal University, Fuzhou 350108, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

文本相似度主要应用于学术论文查重检测、搜索引擎去重等领域，而传统的文本相似度计算方法中的特征项提取与分词环节过于冗杂，而且元素的随机挑选也会产生权重的不确定性. 为了解决传统方法的不足，提出一种基于改进的Jaccard系数确定文档相似度的方法，该算法综合考虑了各元素、样本在文档中的权重及其对多个文档相似度的贡献程度. 实验结果表明，基于改进的Jaccard系数的文档相似度算法具有实效性并且能够得到较高的准确率，适用于各种长度的中英文文档，有效地解决现有技术中存在的文档间相似度计算不精的问题.

关键词:文本相似度;Jaccard系数;文本分析;文本查重;文本检索

Abstract:

Text similarity check is mainly used in Re-check detection of Papers, the deduplication of search engines and other fields. However, it's extremely fussy to extract feature items with the traditional methods for computing the text similarity. In addition, it will bring uncertainty to select elements randomly. To solve these problems, a text similarity method based on improved Jaccard coefficient is proposed. This method takes into account the weights of elements and samples in the document, even the contribution degree to multiple text similarity. The results suggest that the text similarity method based on the improved Jaccard coefficient has been proved to be effective with a satisfactory accuracy, which can be applicable to various lengths of Chinese, English documents. It effectively solves the problem of inexact computing with existing technologies.

Key words:text similarity;Jaccard coefficient;text analysis;text checking;text retrieval

引用本文

俞婷婷,徐彭娜,江育娥,林劼.基于改进的Jaccard系数文档相似度计算方法.计算机系统应用,2017,26(12):137-142

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2017-03-21
最后修改日期:2017-04-13
录用日期:
在线发布日期: 2017-12-07
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码