Extracting Instagram Data – Part 1

For the next two posts, I’m going to introduce some of the techniques I’ve been using to mine content on Instagram, without using their API. This will include extracting a set of posts for a particular hashtag, extracting user information, and extracting a public users timeline. This first post will introduce some of the concepts of Instagram’s REST service it’s front-end JS client uses to load data from, and start by showing how we can use it to extract posts containing a hashtag.

If you’re just interested in the python code for this example, it can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.
Continue reading

ConceptNet to Neo4J

ConceptNet is a pretty useful semantic network, and can either be accessed online, or ran locally with Docker (https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy).

However, as someone who uses Neo4J quite a bit, it’d be useful to run cypher queries over some of those relationships.

To that end, I’ve written a quick script that will take as input, a csv dump of ConceptNet, and convert it into a csv format for Neo4J. You can then use ./neo4j-import to import that data into a ConceptNet database (takes less than a minute which is pretty handy!).

Currently, I’ve only included the following triples in the import script:

start_uri, relationship, end_uri

However, I’ll update the script to add in the other interesting hypergraph properties, enriching the edges of the Neo4J graph.

The script can be found here: https://github.com/tomkdickinson/conceptnet_neo4j

Details for using the ./neo4j-import tool can also be found here: http://neo4j.com/docs/operations-manual/current/tutorial/import-tool/

Extracting a Larger Twitter Dataset

All credit for this method should go to https://github.com/simonlindgren who shared a script of his that motivated me to implement it into this work.

One of the issues with the current implementation of extracting Tweets from Twitters search API, is at any point during the extraction process, Twitter can stop returning results, and the chain of calls is broken. This behaviour is unpredictable, and most likely down to an automated feature in Twitters backend.

An alternative to this is to slice your queries up into time intervals, so rather than perform one large search, you perform several much smaller searches, but your search criteria includes a since and until date parameter, that only extracts tweets for a given day.

This has two major advantages. Firstly, it helps to mitigate the issue where your search chain might be broken prematurely, and secondly it allows you to include parallel data collection, vastly reducing the collection time.

To test both these hypotheses, I’ve written a new class called TwitterSlicer, which takes in a “since”, and “until” date, as well as #threads parameter to specify how many concurrent threads you want running.

For our first hypothesis on #tweets collected, I’ve used the original method TwitterSearch to search for the query “Babylon 5”. To compare, I used the TwitterSlicer class, looking for tweets just between 2016-10-01, and 2016-12-01. Technically the advantage should be with the original approach using TwitterSearch, as it has no time limit. Tabel 1 shows the results.

Table 1
Method Query Total Tweets
TwitterSearch Babylon 5 127
TwitterSlicer Babylon 5 3720

As can be seen, the original query only extracts 127 tweets, where as the TwitterSlicer method extracts 3720.

To test our second hypothesis, I modify the query for TwitterSearch to search between the same date parameters as the TwitterSlicer method. Table 2 shows the results:

Table 2
Method Query Threads Total Tweets Time Taken
TwitterSearch Babylon 5 since:2016-10-01 until:2016-12-01 1 3720 138 seconds
TwitterSlicer Babylon 5 5 3720 45 seconds

Interestingly, adding the date parameters has now fixed the issue with TwitterSearch, and has colleced the same number of tweets as the TwitterSlicer method. However, the big difference here is the TwitterSlicer approach was about 3 times faster, highlighting the second advantage.

Given the nature of the collection process, there is also no reason why the approach couldn’t be implemented using something like Hadoop and Map Reduce, to further scale up collection time.

For those interested in using the modified script, I’ve added a new class to the TwitterScraper.py. As mentioned before, the new class is called TwitterSlicer, and takes in a “since” and “until” parameter, which should be a datetime.datetime class, as well as an #threads parameter to indicate how many concurrent threads you want collecting data. As it uses a ThreadPoolExecutor for parallel tasks, I’ve switched the master branch to use python3. However, I’ve also created a python2 branch, which contains the new code, but does not have the parallel option at the moment.

Python 3, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/master/TwitterScraper.py
Python 2, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/python2/TwitterScraper.py

A Python Wrapper for Mining Frequent Graphs Using ParSeMiS

As part of my PhD, I’ve been doing a lot of frequent graph mining, and thought I’d share some of the scripts I’ve written to mine frequent graphs using an already existing library called ParSeMiS.

ParSeMiS is a Java library that implements the gSpan algorithm (as well as a few others) for detecting frequent subgraphs in graph datasets.

The library takes two main inputs:

  • An input file, that contains a list of graphs
  • A minimum support where s is either an integer > 0, or percentage, where a “subgraph” is considered frequent, if it occurs in more than s graphs

As it’s a Java library, I’ve written a small wrapper in Python to interface with it.

The wrapper uses NetworkX to manage graphs, and takes as input a list of NetworkX graphs. Currently it only deals with directed graphs, but when I get a chance, I’ll add in undirected graph support as well.

If you haven’t used NetworkX before, the choice is mainly to do with its read/write functionality. This allows you to create your graphs in a variety of different formats, and then have NetworkX load them. As the wrapper will return a list of NetworkX graphs, you can then write them to disk (or handle them yourself).

At some point in the future, I may update it to use graph-tools instead, as graph-tools has much better support for things like detecting subgraph isomorphism, and handling larger graphs in general.

The library can be found at: https://github.com/tomkdickinson/parsemis_wrapper

I’ve included a small example that shows how the wrapper works.

Twitter Search Example in Python

This post is response to a few comments. Firstly, as some people had requested it, I’ve re-written the example from here in Python. I’ll be the first to admit, my Python skills aren’t fantastic, but I tested it against collecting 5k tweets and it appeared to work.

The library can be found at https://github.com/tomkdickinson/Twitter-Search-API-Python. Similar to the Java one, it defines an abstract class you can extend from and write your own custom implementation for what you do with the Tweets on each call to Twitter.

Continue reading