###

计算机系统应用英文版:2018,27(1):78-85

View/Add Comment 过刊浏览高级检索 HTML

←前一篇 | 后一篇→

码上扫一扫！

下载全文

基于序列前缀技术的XML频繁路径挖掘算法

张洁, 毛国君

(中央财经大学信息学院, 北京 100081)

Prefix-Based XML Frequent Path Mining Algorithm

ZHANG Jie, MAO Guo-Jun

(School of Information, Central University of Finance and Economics, Beijing 100081, China)

摘要

图/表

参考文献

相似文献

本文已被：浏览 1648次下载 1637次
Received:April 09, 2017 Revised:May 09, 2017

中文摘要: XML文档是半结构化数据，对其进行频繁路径挖掘可以分为两步：XML文档序列化和序列挖掘阶段. 现有的序列化方式将XML文档表示为Xpath路径集合，其中有大量的节点冗余；序列挖掘阶段采用的类Apriori算法需要多次扫描数据库并产生大量的候选集，采用的PrefixSpan算法会产生大量的投影数据库，占用较大的内存. 针对以往XML频繁路径挖掘算法存在的不足，本文提出一种高效的挖掘算法——基于序列前缀技术的XML频繁路径挖掘算法（PXFP，Prefix-based XML Frequent Path Mining Algorithm）. PXFP算法以广度优先方式遍历XML文档树并将每个节点表示为“节点：父节点”的形式，这种序列化的方式减少了节点冗余. 在序列挖掘阶段借鉴PrefixSpan 算法中前缀的概念，但不产生投影数据库，仅得到直接后缀（即前缀的子节点），通过记录频繁子路径的位置信息逐渐扩大频繁模式的长度，位置信息的引入减少了对数据库的扫描. 实验结果表明，PXFP算法取得了比PrefixSpan算法更高的时间和空间效率.

中文关键词: XML频繁路径挖掘序列化位置信息前缀

Abstract:XML documents are semi-structured data, and XML frequent path mining can be divided into two steps: XML document serialization and sequence mining. The existing serialization method expresses the XML document as a set of Xpath paths with a plenty of node redundancy. Algorithms based on Apriori require multiple scanning of the database and can generate a large number of candidate sets. The PrefixSpan algorithm generates a large number of projection databases, occupying a lot of memory space. In view of the shortcomings of the existing algorithms used in XML frequent path mining, this paper proposes an efficient mining algorithm called Prefix-based XML Frequent Path Mining Algorithm (PXFP). The PXFP algorithm traverses the XML document tree in a breadth-first manner and represents each node as “node: parent node”, which reduces the node redundancy. The PXFP does not generate the projection database, but only gets the sub-node of the prefix, and then increases the length of the frequent pattern by the position information of the frequent sub-path, which reduces scanning the database. The experimental results show that the PXFP algorithm achieves higher time and space efficiency than the PrefixSpan algorithm.

keywords: XML frequent path mining serialization location information prefix

文章编号： 中图分类号： 文献标志码：

基金项目:国家自然科学基金（61273293）

Author Name	Affiliation
ZHANG Jie	School of Information, Central University of Finance and Economics, Beijing 100081, China
MAO Guo-Jun	School of Information, Central University of Finance and Economics, Beijing 100081, China

Author Name	Affiliation
ZHANG Jie	School of Information, Central University of Finance and Economics, Beijing 100081, China
MAO Guo-Jun	School of Information, Central University of Finance and Economics, Beijing 100081, China

引用文本：
张洁,毛国君.基于序列前缀技术的XML频繁路径挖掘算法.计算机系统应用,2018,27(1):78-85
ZHANG Jie,MAO Guo-Jun.Prefix-Based XML Frequent Path Mining Algorithm.COMPUTER SYSTEMS APPLICATIONS,2018,27(1):78-85