A Python Wrapper for Mining Frequent Graphs Using ParSeMiS

As part of my PhD, I’ve been doing a lot of frequent graph mining, and thought I’d share some of the scripts I’ve written to mine frequent graphs using an already existing library called ParSeMiS.

ParSeMiS is a Java library that implements the gSpan algorithm (as well as a few others) for detecting frequent subgraphs in graph datasets.

The library takes two main inputs:

  • An input file, that contains a list of graphs
  • A minimum support where s is either an integer > 0, or percentage, where a “subgraph” is considered frequent, if it occurs in more than s graphs

As it’s a Java library, I’ve written a small wrapper in Python to interface with it.

The wrapper uses NetworkX to manage graphs, and takes as input a list of NetworkX graphs. Currently it only deals with directed graphs, but when I get a chance, I’ll add in undirected graph support as well.

If you haven’t used NetworkX before, the choice is mainly to do with its read/write functionality. This allows you to create your graphs in a variety of different formats, and then have NetworkX load them. As the wrapper will return a list of NetworkX graphs, you can then write them to disk (or handle them yourself).

At some point in the future, I may update it to use graph-tools instead, as graph-tools has much better support for things like detecting subgraph isomorphism, and handling larger graphs in general.

The library can be found at: https://github.com/tomkdickinson/parsemis_wrapper

I’ve included a small example that shows how the wrapper works.

3 thoughts on “A Python Wrapper for Mining Frequent Graphs Using ParSeMiS

    • Hi,

      Typically, I’ve found it’s related to the number of nodes in the graphs you’re putting in, and the type of operation you’re trying to do.

      If you’re struggling with memory issues, I’d probably suggest first try using CloseGraph (by itself it uses less memory), and also try collapsing your labels together (i.e. say if the node “tree” appears 3 times, just have the one node, not three).

      Typically, with those kind of improvements, I can run the mining process on a thousand or so graphs (each with 100-1000 nodes), using about 4gb of memory.


Leave a Reply

Your email address will not be published. Required fields are marked *