Extracting Instagram Data – Part 1

For the next two posts, I’m going to introduce some of the techniques I’ve been using to mine content on Instagram, without using their API. This will include extracting a set of posts for a particular hashtag, extracting user information, and extracting a public users timeline. This first post will introduce some of the concepts of Instagram’s REST service it’s front-end JS client uses to load data from, and start by showing how we can use it to extract posts containing a hashtag.

If you’re just interested in the python code for this example, it can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.
Continue reading

ConceptNet to Neo4J

ConceptNet is a pretty useful semantic network, and can either be accessed online, or ran locally with Docker (https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy).

However, as someone who uses Neo4J quite a bit, it’d be useful to run cypher queries over some of those relationships.

To that end, I’ve written a quick script that will take as input, a csv dump of ConceptNet, and convert it into a csv format for Neo4J. You can then use ./neo4j-import to import that data into a ConceptNet database (takes less than a minute which is pretty handy!).

Currently, I’ve only included the following triples in the import script:

start_uri, relationship, end_uri

However, I’ll update the script to add in the other interesting hypergraph properties, enriching the edges of the Neo4J graph.

The script can be found here: https://github.com/tomkdickinson/conceptnet_neo4j

Details for using the ./neo4j-import tool can also be found here: http://neo4j.com/docs/operations-manual/current/tutorial/import-tool/

Extracting a Larger Twitter Dataset

All credit for this method should go to https://github.com/simonlindgren who shared a script of his that motivated me to implement it into this work.

One of the issues with the current implementation of extracting Tweets from Twitters search API, is at any point during the extraction process, Twitter can stop returning results, and the chain of calls is broken. This behaviour is unpredictable, and most likely down to an automated feature in Twitters backend.

An alternative to this is to slice your queries up into time intervals, so rather than perform one large search, you perform several much smaller searches, but your search criteria includes a since and until date parameter, that only extracts tweets for a given day.

This has two major advantages. Firstly, it helps to mitigate the issue where your search chain might be broken prematurely, and secondly it allows you to include parallel data collection, vastly reducing the collection time.

To test both these hypotheses, I’ve written a new class called TwitterSlicer, which takes in a “since”, and “until” date, as well as #threads parameter to specify how many concurrent threads you want running.

For our first hypothesis on #tweets collected, I’ve used the original method TwitterSearch to search for the query “Babylon 5”. To compare, I used the TwitterSlicer class, looking for tweets just between 2016-10-01, and 2016-12-01. Technically the advantage should be with the original approach using TwitterSearch, as it has no time limit. Tabel 1 shows the results.

Table 1
Method Query Total Tweets
TwitterSearch Babylon 5 127
TwitterSlicer Babylon 5 3720

As can be seen, the original query only extracts 127 tweets, where as the TwitterSlicer method extracts 3720.

To test our second hypothesis, I modify the query for TwitterSearch to search between the same date parameters as the TwitterSlicer method. Table 2 shows the results:

Table 2
Method Query Threads Total Tweets Time Taken
TwitterSearch Babylon 5 since:2016-10-01 until:2016-12-01 1 3720 138 seconds
TwitterSlicer Babylon 5 5 3720 45 seconds

Interestingly, adding the date parameters has now fixed the issue with TwitterSearch, and has colleced the same number of tweets as the TwitterSlicer method. However, the big difference here is the TwitterSlicer approach was about 3 times faster, highlighting the second advantage.

Given the nature of the collection process, there is also no reason why the approach couldn’t be implemented using something like Hadoop and Map Reduce, to further scale up collection time.

For those interested in using the modified script, I’ve added a new class to the TwitterScraper.py. As mentioned before, the new class is called TwitterSlicer, and takes in a “since” and “until” parameter, which should be a datetime.datetime class, as well as an #threads parameter to indicate how many concurrent threads you want collecting data. As it uses a ThreadPoolExecutor for parallel tasks, I’ve switched the master branch to use python3. However, I’ve also created a python2 branch, which contains the new code, but does not have the parallel option at the moment.

Python 3, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/master/TwitterScraper.py
Python 2, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/python2/TwitterScraper.py

A Python Wrapper for Mining Frequent Graphs Using ParSeMiS

As part of my PhD, I’ve been doing a lot of frequent graph mining, and thought I’d share some of the scripts I’ve written to mine frequent graphs using an already existing library called ParSeMiS.

ParSeMiS is a Java library that implements the gSpan algorithm (as well as a few others) for detecting frequent subgraphs in graph datasets.

The library takes two main inputs:

  • An input file, that contains a list of graphs
  • A minimum support where s is either an integer > 0, or percentage, where a “subgraph” is considered frequent, if it occurs in more than s graphs

As it’s a Java library, I’ve written a small wrapper in Python to interface with it.

The wrapper uses NetworkX to manage graphs, and takes as input a list of NetworkX graphs. Currently it only deals with directed graphs, but when I get a chance, I’ll add in undirected graph support as well.

If you haven’t used NetworkX before, the choice is mainly to do with its read/write functionality. This allows you to create your graphs in a variety of different formats, and then have NetworkX load them. As the wrapper will return a list of NetworkX graphs, you can then write them to disk (or handle them yourself).

At some point in the future, I may update it to use graph-tools instead, as graph-tools has much better support for things like detecting subgraph isomorphism, and handling larger graphs in general.

The library can be found at: https://github.com/tomkdickinson/parsemis_wrapper

I’ve included a small example that shows how the wrapper works.

Twitter Search Example in Python

This post is response to a few comments. Firstly, as some people had requested it, I’ve re-written the example from here in Python. I’ll be the first to admit, my Python skills aren’t fantastic, but I tested it against collecting 5k tweets and it appeared to work.

The library can be found at https://github.com/tomkdickinson/Twitter-Search-API-Python. Similar to the Java one, it defines an abstract class you can extend from and write your own custom implementation for what you do with the Tweets on each call to Twitter.

Continue reading

An Intro to Gradle

As I use Gradle in a lot of my projects these days, I thought I would write a few posts about my experiences using it. This first introductory post won’t go into much detail, so if you already know how to use Gradle, even if it’s just the very basics, I would skip this post for now.

So why Gradle?

Well for me, there are two main reasons:

  1. It’s more adaptable than other build languages like maven. It’s commonly referred to as the sweet spot between Maven and Ant, due to it having the useful dependency and life-cycle features that Maven has, as well as the task based options that you get with Ant.
  2. The readability of gradle build scripts are a lot easier than both Ant and Maven (at least in my opinion).

To get started with Gradle, you’ll need a copy of Gradle installed which can be found here. The latest version will do for now (at time of writing, latest version was 2.6). Unzip it to somewhere on your machine and add to your path.

Add Gradle to Path on Windows

  1. Windows press “Start + R”
  2. Type “control /name microsoft.system” and hit enter
  3. Go to “Advanced System Settings”
  4. Go to “Environment Variables”
  5. Under “System Variables”, select “Path” and hit edit
  6. Append this to the end of the variable value: “;\path\to\gradle\bin”
  7. Exit everything you opened
  8. Open a command prompt and type “gradle -version” to check it’s working
Add Gradle to Path on Linux

  1. Open up a text editor.
  2. Open the file ~/.bashrc
  3. At the bottom, add: export PATH=$PATH:/path/to/gradle/bin/:.;
  4. Save and close
  5. Open up a terminal, and type “gradle -version” to check it’s working

With Gradle installed, all you now need is a build.gradle file, which defines your Gradle build.

Here’s an example that IntelliJ IDEA generates for you when you want a new Gradle Java project:

group 'uk.co.tomkdickinson.scraper'
version '1.0-SNAPSHOT'
 
apply plugin: 'java'
 
sourceCompatibility = 1.7
 
repositories {
    mavenCentral()
}
 
dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'
}

Lets take a look at the different components that make the build script:

group 'uk.co.tomkdickinson.scraper'
version '1.0-SNAPSHOT'

These are standard gradle properties that indicate the group, and version number of your application. You can find more details on these and other variables in section 13.2.1 of the gradle documentation.

apply plugin: 'java'

Plugins in gradle are what specifies the default build logic for your project. Here we apply the Java plugin which defines a basic Java project. This tells Gradle that the following locations contain your Java code:

Java classes

src/main/java/

Java Unit Tests

src/test/java/

When specifying plugins, there are additional properties you can add to your build.gradle file.

For example:

sourceCompatibility = 1.7

specifies that the source compatibility that gradle uses when compiling your Java classes, in this case, java version 1.7. More of these types of properties can be found in the Gradle documentation for the Java Plugin.

The last part of this build script is all about dependency management. If you are familiar already with dependency management, then Gradle is compatible out of the box with both Maven and Ivy. For those who are not, dependency management is a really useful tool to automatically include other libraries in your project. In the context of Maven, you specify a groupName, artifactId, and version of a particular library you want to include in your project. When your project builds, it will then automatically fetch this library and makes it available in your project.

Lets take a look at our example:

repositories {
    mavenCentral()
}
 
dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'
}

Here we notice two sections: repositories, and dependencies.

Repositories allows us to specify sites that host libraries (called a nexus). mavenCentral() is a built in method in gradle that specifies the central repository. If you want to include local maven artifacts that you may have installed locally, you can include mavenLocal(), or if you want to add a different repository altogether, you can specify:

mvn {
    url "http://yourrepository"
}

Dependencies are where you link your libraries. In our example, IDEA has automatically included junit. As it’s a library we only want when running our unit tests, we specify “testCompile”. This library then won’t be bundled with our final jar when built.

For any other dependency you want to include in your application, you specify:

    compile group: 'groupName', name 'artifcatName', version: 'versionNumber'

To test that the build file works, you can just run “gradle build” from the command line, when you’re in the same directory as your build.gradle

That’s the very basics for using Gradle. I’ll follow up with some more useful functionality in the near future, rather than just constructing a skeleton build file.

Scraping Tweets Directly from Twitters Search – Update

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script. If you are just interested in grabbing the code, I’ve pushed the changes to https://github.com/tomkdickinson/TwitterSearchAPI, but feel free to read on if you want to know what’s been changed and why.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
    if(minTweet==null) {
        minTweet = response.getTweets().get(0).getId();
    }
    continueSearch = saveTweets(response.getTweets());
    String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
    if(!minTweet.equals(maxTweet)) {
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
        url = constructURL(query, maxPosition);
    }
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
    if(query==null || query.isEmpty()) {
        throw new InvalidQueryException(query);
    }
    try {
        URIBuilder uriBuilder;
        uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
        uriBuilder.addParameter(QUERY_PARAM, query);
        uriBuilder.addParameter(TYPE_PARAM, "tweets");
        if (maxPosition != null) {
            uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
        }
        return uriBuilder.build().toURL();
    } catch(MalformedURLException | URISyntaxException e) {
        e.printStackTrace();
        throw new InvalidQueryException(query);
    }
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;

Scraping Tweets Directly from Twitters Search Page – Part 2

UPDATE The code outlined here won’t work directly with Twitter now as they have updated their source code. Thankfully it only takes a few small changes to modify this script for it to work, which I’ve outlined here.

In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API. This post now deals with an example implementation in Java, with an example git repository that you can improve on, or use yourself. You can find the git repository over at https://github.com/tomkdickinson/TwitterSearchAPI.

Continue reading