语义增强的多策略政策术语抽取系统
作者:
基金项目:

国家自然科学基金(61802381)


Semantic Enhanced Multi-strategy Policy Term Extraction System
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [15]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    针对政策术语具有时效性、低频度、稀疏性和复合短语的特点, 传统术语抽取方法难以满足需求的问题, 设计实现了语义增强的多策略政策术语抽取系统. 该系统从频繁项挖掘和语义相似度两个维度对政策文本特征进行建模, 融合多种频繁模式挖掘策略选取特征种子词, 利用预训练语言模型增强语义匹配来召回低频且稀疏的政策术语, 实现了从无词库冷启动到有词库热启动半自动化的政策术语抽取. 该系统能够提升政策文本分析效果, 为建设智慧政务服务平台提供技术支持.

    Abstract:

    Policy terms are characterized by timeliness, low frequency, sparsity, and compound phrases. To address the difficulty of traditional term extraction methods in meeting demands, we design and implement a semantic enhanced multi-strategy system of policy term extraction. The system models the features of policy texts from the two dimensions of frequent item mining and semantic similarity. Feature seed words are selected by integrating multiple frequent pattern mining strategies. Low-frequency and sparse policy terms are recalled by pre-training the language model and enhancing semantic matching. Transforming from a cold start without a thesaurus to a hot start with a thesaurus, the system achieves semi-automatic extraction of policy terms. The proposed system can improve the effect of policy text analysis and provide technical support for the construction of a smart government service platform.

    参考文献
    [1] Wang H, Wang B, Zou MY, et al. New cyber word discovery using Chinese word segmentation. Proceedings of the IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). Chengdu: IEEE, 2019. 970–975.
    [2] 曾浩, 詹恩奇, 郑建彬, 等. 基于扩展规则与统计特征的未登录词识别. 计算机应用研究, 2019, 36(9): 2704–2707, 2711. [doi: 10.19734/j.issn.1001-3695.2018.02.0140
    [3] 赵志滨, 石玉鑫, 李斌阳. 基于句法分析与词向量的领域新词发现方法. 计算机科学, 2019, 46(6): 29–34. [doi: 10.11896/j.issn.1002-137X.2019.06.003
    [4] Kafando R, Decoupes R, Valentin S, et al. ITEXT-BIO: Intelligent term extraction for biomedical analysis. Health Information Science and Systems, 2021, 9(1): 29. [doi: 10.1007/s13755-021-00156-6
    [5] Chen MJ, Xie ZP, Chen XQ, et al. Novel bidirectional aggregation degree feature extraction method for patent new word discovery. Journal of Computer Applications, 2020, 40(3): 631–637. [doi: 10.11772/j.issn.1001-9081.2019071193
    [6] 王煜, 徐建民. 用于网络新闻热点识别的热点新词发现. 计算机应用, 2020, 40(12): 3513–3519. [doi: 10.11772/j.issn.1001-9081.2020040549
    [7] Li P, Guang YX, Qiao TL. Research on Chinese new word recognition method. Proceedings of the 4th International Conference on Electronic Information Technology and Computer Engineering. Xiamen: ACM, 2020. 703–707.
    [8] 陈先来, 韩超鹏, 安莹, 等. 基于互信息和逻辑回归的新词发现. 数据分析与知识发现, 2019, 3(8): 105–113. [doi: 10.11925/infotech.2096-3467.2018.1445
    [9] Chen P, Lv XQ, Sun N, et al. Building phrase dictionary for defective products with convolutional neural network. Data Analysis and Knowledge Discovery, 2020, 4(11): 112–120. [doi: 10.11925/infotech.2096-3467.2020.0214
    [10] 张一帆, 张军莲, 汪鸣泉, 等. 基于条件随机场和词向量的能源政策领域新词发现. 南京理工大学学报, 2021, 45(1): 37–45. [doi: 10.14177/j.cnki.32-1397n.2021.45.01.004
    [11] Qian Y, Du Y, Deng XW, et al. Detecting new Chinese words from massive domain texts with word embedding. Journal of Information Science, 2019, 45(2): 196–211. [doi: 10.1177/0165551518786676
    [12] 张乐, 冷基栋, 吕学强, 等. MWEC: 一种基于多语义词向量的中文新词发现方法. 数据分析与知识发现, 2022, 6(1): 113–121. [doi: 10.11925/infotech.2096-3467.2021.0684
    [13] Choi KH, Na SH. FastText and BERT for automatic term extraction. Annual Conference on Human and Language Technology. Human and Language Technology, 2021: 612–616
    [14] 肖仰华, 徐波, 林欣, 等. 知识图谱: 概念与技术. 北京: 电子工业出版社, 2020.
    [15] Liu YH, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

曹秀娟,马志柔,朱涛,张庆文,杨燕,叶丹.语义增强的多策略政策术语抽取系统.计算机系统应用,2022,31(9):152-158

复制
分享
文章指标
  • 点击次数:847
  • 下载次数: 1738
  • HTML阅读次数: 1280
  • 引用次数: 0
历史
  • 收稿日期:2021-12-21
  • 最后修改日期:2022-01-24
  • 在线发布日期: 2022-06-16
文章二维码
您是第12795346位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号