基于字符树结构的高性能中文词库技术

doi:10.15888/j.cnki.csa.007052

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年7月27日 22:05 星期日

首页 > 过刊浏览>2019年第28卷第8期 >262-267. DOI:10.15888/j.cnki.csa.007052

PDF HTML阅读 XML下载导出引用引用提醒

基于字符树结构的高性能中文词库技术
DOI:
                        10.15888/j.cnki.csa.007052
                    
CSTR:
                        
                    
作者:
                        杨光豹杨光豹
浙江广播电视大学 青田学院, 青田 323900
在期刊界中查找
在百度中查找
在本站中查找
杨丰赫杨丰赫
东南大学 网络空间安全学院, 南京 211189
在期刊界中查找
在百度中查找
在本站中查找
郑慧锦郑慧锦
浙江青田县职业技术学校, 青田 323900
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

High Performance Chinese Lexicon Technology Based on Character Tree Structure

Author:

YANG Guang-Bao
YANG Guang-Bao
Qingtian College, Zhejiang Radio & TV University, Qingtian 323900, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Feng-He
YANG Feng-He
School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
在期刊界中查找
在百度中查找
在本站中查找
ZHENG Hui-Jin
ZHENG Hui-Jin
Zhejiang Qingtian Vocational and Technical School, Qingtian 323900, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

海量中文信息处理是大数据处理的一个分支，而利用大数据技术进行中文信息处理一定离不开中文分词，所以中文分词技术是大数据中文信息处理的基础性技术.中文分词技术自本世纪以来，一直在性能与精确度两个方向在推进；在性能方面主要以改进分词扫瞄算法，改进词库存储技术与查询方式来提高性能.在精确度上主要是对未登录词与歧义词的甄别与处理方法进行改进.本文摒弃了通过词库索引查询的思想，提出一种基于字符树的词库存储结构.它的分词速度是普通折半法的35倍，占用内存只是它的1/5.它将为大数据技术在处理中文信息时在性能上推进了一大步.

关键词:字符树;中文分词;散列法;折半法;时间复杂度

Abstract:

Massive Chinese information processing is a branch of big data processing, and the use of big data technology for Chinese information processing must be inseparable from Chinese word segmentation, so Chinese word segmentation technology is the basic technology of big data Chinese information processing. Chinese word segmentation technology has been advancing in performance and accuracy since this century. In terms of performance, it mainly improves the segmentation scanning algorithm, the word bank storage technology, and query method to improve the performance. In terms of accuracy, it is mainly to improve the processing method of unregistered words and ambiguous words. This paper gives up the idea of searching by lexicon index and proposes a lexicon storage structure based on character tree. Its segmenting speed is 35 times faster than the normal half method, occupying only 1/5 of its memory. It will be a big step forward in the performance of big data technology in processing Chinese information.

Key words:character tree;Chinese word segmentation;hash;binary query;time complexity

引用本文

杨光豹,杨丰赫,郑慧锦.基于字符树结构的高性能中文词库技术.计算机系统应用,2019,28(8):262-267

复制

文章指标

点击次数:2631
下载次数: 2006
HTML阅读次数: 1604
引用次数: 0

历史

收稿日期:2019-02-22
最后修改日期:2019-03-22
录用日期:
在线发布日期: 2019-08-14
出版日期: 2019-08-15

微信公众号

网站二维码

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码