基于改进词向量的石油文档语义关系识别
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

科技部创新方法工作专项(2015IM01030)


Semantic Relationship Recognition of Oil Documents Based on Improved Word Vector
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 增强出版
  • |
  • 文章评论
    摘要:

    语义关系识别是对文档进行处理识别出包含的语义关系的过程,是构建本体重要组成部分之一.在石油领域本体的构建过程中,由于石油领域的文档具有组合词多的特点,语义关系识别更加困难.目前使用的语义识别算法主要是基于关联规则的识别算法,但此类算法没有领域针对性.通过分析石油文档的特点,提出一种基于改进词向量的石油文档语义关系识别算法,以连续词袋(Continuous Bag-Of-Words,CBOW)模型为基础,对石油专业术语进行扩展训练,引入负采样和二次采样技术提高训练准确率和效率,利用向量特征训练支持向量机(Support Vector Mechine,SVM)分类器进行语义关系识别.实验结果表明,该方法训练的词向量能够准确识别石油领域的语义关系,在石油领域具有明显的优势.

    Abstract:

    Semantic relationship recognition is the process of document processing and is used to identify the semantic relations contained in the process, which is an important part of the construction of ontology. In the process of constructing petroleum field ontology, the semantic relationship identification is more difficult because the documents in the petroleum field have their unique characteristics. The current semantic recognition algorithm is mainly based on association rules' recognition algorithm, but there is no field-specific orientation. By analyzing the characteristics of petroleum documents, this study proposes a semantic relationship recognition algorithm for petroleum documents based on improved word vector. Based on the Continuous Bag-Of-Words (CBOW) model, this study carries out expanded model training on petroleum terminologies and introduces negative sampling and subsampling techniques to improve the training accuracy and efficiency. Feature vectors are used in training the Support Vector Mechine (SVM) classifier for semantic relationship recognition. The experimental results show that the word vectors trained by this method can accurately identify the semantic relations contained in documents in the petroleum field and have obvious advantages.

    参考文献
    相似文献
    引证文献
引用本文

宫法明,朱朋海.基于改进词向量的石油文档语义关系识别.计算机系统应用,2018,27(8):153-158

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2017-12-10
  • 最后修改日期:2018-01-04
  • 录用日期:
  • 在线发布日期: 2018-08-04
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号