本文已被:浏览 774次 下载 1899次
Received:September 28, 2020 Revised:October 28, 2020
Received:September 28, 2020 Revised:October 28, 2020
中文摘要: 西班牙语(以下简称西语)是仅次于汉语的世界第二大母语语言, 是联合国6种官方语言之一. 西语复杂的词形变化和语法规则, 导致C-value等经典的词语提取方法的效果无法保证, 进而影响基于西语文本挖掘的效果. 因此, 本文研究西语文本词语提取方法, 为西语文本的结构化建模提供完备的词库. 给定待分析的西班牙语文本, 该方法分3步提取得到词语集合: 文本预处理、候选词语提取和DC-value成词度计算. 其中, 前两步所得的候选词语集合可直接用作文本挖掘的词库; 第三步所得的候选词语成词度可辅助判断候选词语成词的可能性, 减轻人工判断的工作量. 实验结果表明, 本文方法自动提取的西文词语集合的准确率达到80%, 且召回率远高于经典方法, 能够为西语文本挖掘提供有效的词库.
Abstract:As one of the six working languages of the United Nations and a major mother tongue second only to Chinese, Spanish has complex morphological changes and grammatical rules. These result in the inability of classic term extraction methods such as C-value and thus affect the effect of Spanish text analysis. This study proposes a Spanish term extraction method to automatically construct a complete lexicon for text modeling. Given a Spanish text or corpus, the method extracts terms in three steps: preprocessing the texts, extracting candidate terms, and calculating term-hood indexes of the candidate terms based on DC-value. The set of candidate terms obtained in the first two steps can be used directly as the lexicon for text mining. Meanwhile, the term-hood indexes obtained in the third step are essential for reducing the manual workload in determining whether the candidates are really terms. According to experiments, the proposed method has a high accuracy of 80% and a recall much higher than that of classic methods, providing the effective lexicon for Spanish text mining.
keywords: spanish text text mining term extraction DC-value
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金 (71771054)
引用文本:
于娟,颜煜铃,简梓炜,张晨.基于DC-Value的西班牙语文本词语提取方法.计算机系统应用,2021,30(6):271-277
YU Juan,YAN Yu-Ling,JIAN Zi-Wei,ZHANG Chen.Extracting Terms from Spanish Corpora Based on DC-Value.COMPUTER SYSTEMS APPLICATIONS,2021,30(6):271-277
于娟,颜煜铃,简梓炜,张晨.基于DC-Value的西班牙语文本词语提取方法.计算机系统应用,2021,30(6):271-277
YU Juan,YAN Yu-Ling,JIAN Zi-Wei,ZHANG Chen.Extracting Terms from Spanish Corpora Based on DC-Value.COMPUTER SYSTEMS APPLICATIONS,2021,30(6):271-277