Skip to content

Scraping Tweets Directly from Twitters Search – Update

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script. If you are just interested in grabbing the code, I’ve pushed the changes to https://github.com/tomkdickinson/TwitterSearchAPI, but feel free to read on if you want to know what’s been changed and why.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
    if(minTweet==null) {
        minTweet = response.getTweets().get(0).getId();
    }
    continueSearch = saveTweets(response.getTweets());
    String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
    if(!minTweet.equals(maxTweet)) {
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
        url = constructURL(query, maxPosition);
    }
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
    if(query==null || query.isEmpty()) {
        throw new InvalidQueryException(query);
    }
    try {
        URIBuilder uriBuilder;
        uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
        uriBuilder.addParameter(QUERY_PARAM, query);
        uriBuilder.addParameter(TYPE_PARAM, "tweets");
        if (maxPosition != null) {
            uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
        }
        return uriBuilder.build().toURL();
    } catch(MalformedURLException | URISyntaxException e) {
        e.printStackTrace();
        throw new InvalidQueryException(query);
    }
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;

7 Comments

  1. Kvebl Kvebl

    Tnx Tom. It works flawlessly!

  2. Reza Reza

    Hi,
    Could you by any chance be interested in doing the same thing with Python?

    I tried to do it myself, but no matter what query string with whichever max_position I send, I get the first 20 tweets (even if I only set the range to include, say, 4 tweets).

    • Hi Reza,

      Sure. If I get a chance over the weekend, I’ll see if I can get an example done in python as well. My python skills aren’t as good as my Java, but the same thing can be done using stuff like beautiful soup. I’ll let you know once I’ve got a sample done.

    • HI Reza,

      I had a bit of time today so I implemented it in Python.

      Blog post is here and the code repository is here.

      • Reza Reza

        Hi Tom,

        This is amazing!

        I found my way to this page on the train today trying to give an old problem another go. Once I read my way down to the comments, it took me a few seconds to realize that Reza in the comments is me! Why I wasn’t notified when you replied to my comment 10 months ago is beyond me, but man did you respond. I just checked the Python code and it worked like a charm. Thanks a lot.

        Best,

  3. Farah Farah

    Hi Tom,

    Thanks for your great code.

    I have just a problem with this code. I never can get the tweets before ten days ago, because the items-html in response is not filled, but in search page I can see those old tweets. you can test it with “shagerd” instead of “babylon 5”.

  4. Camilo Romero Camilo Romero

    Hello Tom,

    After to test your code and change search query name, I got a problem. A new pop-up message appear, saying Pyhton has stop working … Do you have any idea or suggestion how can I solve this problem?

Leave a Reply