基于数据增强的地质文本主题模型

doi:10.15888/j.cnki.csa.008563

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月18日 3:29 星期五

首页 > 过刊浏览>2022年第31卷第7期 >290-297. DOI:10.15888/j.cnki.csa.008563

PDF HTML阅读 XML下载导出引用引用提醒

基于数据增强的地质文本主题模型
DOI:
                        10.15888/j.cnki.csa.008563
                    
CSTR:
                        
                    
作者:
                        张竞元张竞元
中国地质大学(武汉) 计算机学院, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找
刘刚刘刚
中国地质大学(武汉) 计算机学院, 武汉 430074;智能地学信息处理湖北省重点实验室, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找
曾粤曾粤
中国地质大学(武汉) 计算机学院, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找
周大双周大双
中国地质大学(武汉) 计算机学院, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找
陈麒玉陈麒玉
中国地质大学(武汉) 计算机学院, 武汉 430074;智能地学信息处理湖北省重点实验室, 武汉 430074
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金联合重点项目(U1711267); 水利部协作项目(2019306340); 中国地质大学(武汉)国家级创新训练计划(201810491232)

Geological Text Topic Model Based on Data Augmentation

Author:

ZHANG Jing-Yuan
ZHANG Jing-Yuan
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Gang
LIU Gang
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430074, China;Hubei Key Laboratory of Intelligent Geo-Information Processing, Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找
ZENG Yue
ZENG Yue
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找
ZHOU Da-Shuang
ZHOU Da-Shuang
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Qi-Yu
CHEN Qi-Yu
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430074, China;Hubei Key Laboratory of Intelligent Geo-Information Processing, Wuhan 430074, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

直接利用主题模型对地质文本进行聚类时会出现主题准确性低、主题关键词连续性差等问题, 本文采取了相关改进方法. 首先在分词阶段采用基于词频统计的重复词串提取算法, 保留地质专业名词以准确提取文本主题, 同时减少冗余词串数量节约内存花销, 提升保留词的提取效率. 另外, 使用基于TF-IDF和词向量的文本数据增强算法, 对原始分词语料进行处理以强化文本主题特征. 之后该算法与主题模型相结合在处理后的语料上提取语料主题. 由于模型的先验信息得到增强, 故性能得以提高. 实验结果表明本文算法与LDA模型相结合的方法表现较好, 在相关指标及输出结果上均优于其他方法.

关键词:地质文本;主题模型;数据增强;词向量;TF-IDF

Abstract:

Problems such as low topic accuracy and poor continuity of topic keywords occur when geological texts are directly clustered by topic models. This study adopts relevant improvement methods. In the word segmentation stage, the repeated word string extraction algorithm based on word frequency statistics is adopted. Geological terms are retained to accurately extract text topics, and redundant word strings are reduced to save memory costs. In this way, the efficiency of retained word extraction is improved. In addition, a text data augmentation algorithm based on term frequency-inverse document frequency (TF-IDF) and word vector is used to process the original word segmentation corpus and thereby strengthen the text topic features. Then, the algorithm is combined with the topic model to extract the corpus topics on the processed corpus. The performance of the model is improved due to its enhanced prior information. The experimental results show that the method combining the proposed algorithm with the latent Dirichlet allocation (LDA) model performs well, superior to other methods in all the related indexes and output results.

Key words:geological text;topic model;data augmentation;word vector;term?frequency-inverse document?frequency (TF-IDF)

引用本文

张竞元,刘刚,曾粤,周大双,陈麒玉.基于数据增强的地质文本主题模型.计算机系统应用,2022,31(7):290-297

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-10-07
最后修改日期:2021-11-08
录用日期:
在线发布日期: 2022-05-31
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码