Extracting Instagram Data – Part 1

For the next two posts, I’m going to introduce some of the techniques I’ve been using to mine content on Instagram, without using their API. This will include extracting a set of posts for a particular hashtag, extracting user information, and extracting a public users timeline. This first post will introduce some of the concepts of Instagram’s REST service it’s front-end JS client uses to load data from, and start by showing how we can use it to extract posts containing a hashtag.

If you’re just interested in the python code for this example, it can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.
Continue reading

Extracting a Larger Twitter Dataset

All credit for this method should go to https://github.com/simonlindgren who shared a script of his that motivated me to implement it into this work.

One of the issues with the current implementation of extracting Tweets from Twitters search API, is at any point during the extraction process, Twitter can stop returning results, and the chain of calls is broken. This behaviour is unpredictable, and most likely down to an automated feature in Twitters backend.

An alternative to this is to slice your queries up into time intervals, so rather than perform one large search, you perform several much smaller searches, but your search criteria includes a since and until date parameter, that only extracts tweets for a given day.

This has two major advantages. Firstly, it helps to mitigate the issue where your search chain might be broken prematurely, and secondly it allows you to include parallel data collection, vastly reducing the collection time.

To test both these hypotheses, I’ve written a new class called TwitterSlicer, which takes in a “since”, and “until” date, as well as #threads parameter to specify how many concurrent threads you want running.

For our first hypothesis on #tweets collected, I’ve used the original method TwitterSearch to search for the query “Babylon 5”. To compare, I used the TwitterSlicer class, looking for tweets just between 2016-10-01, and 2016-12-01. Technically the advantage should be with the original approach using TwitterSearch, as it has no time limit. Tabel 1 shows the results.

Table 1
Method Query Total Tweets
TwitterSearch Babylon 5 127
TwitterSlicer Babylon 5 3720

As can be seen, the original query only extracts 127 tweets, where as the TwitterSlicer method extracts 3720.

To test our second hypothesis, I modify the query for TwitterSearch to search between the same date parameters as the TwitterSlicer method. Table 2 shows the results:

Table 2
Method Query Threads Total Tweets Time Taken
TwitterSearch Babylon 5 since:2016-10-01 until:2016-12-01 1 3720 138 seconds
TwitterSlicer Babylon 5 5 3720 45 seconds

Interestingly, adding the date parameters has now fixed the issue with TwitterSearch, and has colleced the same number of tweets as the TwitterSlicer method. However, the big difference here is the TwitterSlicer approach was about 3 times faster, highlighting the second advantage.

Given the nature of the collection process, there is also no reason why the approach couldn’t be implemented using something like Hadoop and Map Reduce, to further scale up collection time.

For those interested in using the modified script, I’ve added a new class to the TwitterScraper.py. As mentioned before, the new class is called TwitterSlicer, and takes in a “since” and “until” parameter, which should be a datetime.datetime class, as well as an #threads parameter to indicate how many concurrent threads you want collecting data. As it uses a ThreadPoolExecutor for parallel tasks, I’ve switched the master branch to use python3. However, I’ve also created a python2 branch, which contains the new code, but does not have the parallel option at the moment.

Python 3, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/master/TwitterScraper.py
Python 2, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/python2/TwitterScraper.py

Twitter Search Example in Python

This post is response to a few comments. Firstly, as some people had requested it, I’ve re-written the example from here in Python. I’ll be the first to admit, my Python skills aren’t fantastic, but I tested it against collecting 5k tweets and it appeared to work.

The library can be found at https://github.com/tomkdickinson/Twitter-Search-API-Python. Similar to the Java one, it defines an abstract class you can extend from and write your own custom implementation for what you do with the Tweets on each call to Twitter.

Continue reading

Scraping Tweets Directly from Twitters Search – Update

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script. If you are just interested in grabbing the code, I’ve pushed the changes to https://github.com/tomkdickinson/TwitterSearchAPI, but feel free to read on if you want to know what’s been changed and why.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
    if(minTweet==null) {
        minTweet = response.getTweets().get(0).getId();
    }
    continueSearch = saveTweets(response.getTweets());
    String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
    if(!minTweet.equals(maxTweet)) {
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
        url = constructURL(query, maxPosition);
    }
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
    if(query==null || query.isEmpty()) {
        throw new InvalidQueryException(query);
    }
    try {
        URIBuilder uriBuilder;
        uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
        uriBuilder.addParameter(QUERY_PARAM, query);
        uriBuilder.addParameter(TYPE_PARAM, "tweets");
        if (maxPosition != null) {
            uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
        }
        return uriBuilder.build().toURL();
    } catch(MalformedURLException | URISyntaxException e) {
        e.printStackTrace();
        throw new InvalidQueryException(query);
    }
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;

Scraping Tweets Directly from Twitters Search Page – Part 1

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.

EDIT 2 – A useful update to the python version of this script, that allows larger datasets to be collected can be found here.

In fairly recent news, Twitter has started indexing it’s entire history of Tweets going all the way back to 2006. Hurrah for data scientists! However, even with this news (at time of writing), their search API is still restricted to the past seven days of Tweets. While I doubt this will be the case permanently, as a useful exercise this post presents how we can search for Tweets from Twitter without necessarily using their API. Besides the indexing, there is also the advantage that Twitter is a little more liberal with rate limits, and you don’t require any authentication keys.

The post will be split up into two parts, this first part looking at what we can extract from Twitter and how we might start to go about it, and the second a tutorial on how we can implement this in Java.

Continue reading