High Performance Chinese Lexicon Technology Based on Character Tree Structure

doi:10.15888/j.cnki.csa.007052

AIPUB归智期刊联盟

WeChat

Mobile website

2025-4-8- 2

Home > Archive>Volume 28, Issue 8, 2019 >262-267. DOI:10.15888/j.cnki.csa.007052

PDF HTML XML Export Cite reminder

High Performance Chinese Lexicon Technology Based on Character Tree Structure
DOI:
                        10.15888/j.cnki.csa.007052
                    
CSTR:
                        [cstr]
                    
Author:
                        YANG Guang-BaoYANG Guang-Bao
Qingtian College, Zhejiang Radio & TV University, Qingtian 323900, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
YANG Feng-HeYANG Feng-He
School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
ZHENG Hui-JinZHENG Hui-Jin
Zhejiang Qingtian Vocational and Technical School, Qingtian 323900, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Massive Chinese information processing is a branch of big data processing, and the use of big data technology for Chinese information processing must be inseparable from Chinese word segmentation, so Chinese word segmentation technology is the basic technology of big data Chinese information processing. Chinese word segmentation technology has been advancing in performance and accuracy since this century. In terms of performance, it mainly improves the segmentation scanning algorithm, the word bank storage technology, and query method to improve the performance. In terms of accuracy, it is mainly to improve the processing method of unregistered words and ambiguous words. This paper gives up the idea of searching by lexicon index and proposes a lexicon storage structure based on character tree. Its segmenting speed is 35 times faster than the normal half method, occupying only 1/5 of its memory. It will be a big step forward in the performance of big data technology in processing Chinese information.

Key words:character tree;Chinese word segmentation;hash;binary query;time complexity

Get Citation

杨光豹,杨丰赫,郑慧锦.基于字符树结构的高性能中文词库技术.计算机系统应用,2019,28(8):262-267

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:February 22,2019
Revised:March 22,2019
Adopted:
Online: August 14,2019
Published: August 15,2019

Article QR Code

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address：4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code：100190
Phone：010-62661041 Fax： Email：csa (a) iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063