Graph-Based Substructure Pattern Mining.docVIP

  • 16
  • 0
  • 约1.55万字
  • 约 9页
  • 2016-05-25 发布于安徽
  • 举报
Graph-Based Substructure Pattern Mining.doc

Graph-Based Substructure Pattern Mining Abstract We investigate new approaches for frequent graph-based pattern mining in graph datasets and propose a novel algorithm called gSpan (graph-based Substructure pattern mining), which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order, gSpan adopts the depth-first search strategy to mine frequent connected subgraphs efficiently. Our performance study shows that gSpan substantially outperforms previous algorithms, sometimes by an order of magnitude. 1. Introduction Frequent substructure pattern mining has been an emerging data mining problem with many scientific and commercial applications. As a general data structure, labeled graph can be used to model much complicated substructure patterns among data. Given a graph dataset, D={G0, G1, ..., Gn}, support(g) denotes the number of graphs (in D) in which g is a subgraph. The problem of frequent subgraph mining is to find any subgraph g s.t. Support(g) ≥ minSup (a minimum support threshold). To reduce the complexity of the problem (meanwhile considering the connectivity property of hidden structures in most situations), only frequent connected subgraphs are studied in this paper. The kernel of frequent subgraph mining is subgraph isomorphism test. Lots of well-known pair-wise isomorphism testing algorithms were developed. However, the frequent subgraph mining problem was not explored well. Recently, Inokuchi et al. [4] proposed an Apriori-based algorithm, called AGM, to discover all frequent (both connected and disconnected) substructures. Kuramochi and Karypis [5] further developed the idea using adjacent representation of graph and an edge-growing strategy. Their algorithm, called FSG, is able to find all frequent connected subgraphs from a chemical compound dataset in 10 minutes with 6.5% minimum support.

文档评论(0)

1亿VIP精品文档

相关文档