Skip to content

Scraping Tweets Directly from Twitters Search Page – Part 2

UPDATE The code outlined here won’t work directly with Twitter now as they have updated their source code. Thankfully it only takes a few small changes to modify this script for it to work, which I’ve outlined here.

In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API. This post now deals with an example implementation in Java, with an example git repository that you can improve on, or use yourself. You can find the git repository over at https://github.com/tomkdickinson/TwitterSearchAPI.

First, let’s have a quick recap of what we learned in the previous post. We have a URL that we can use to search Twitter with:

https://twitter.com/i/search/timeline

This includes the following parameters:

Key Value
q URL encoded query string
f Type of query (omit for top results or realtime for all)
scroll_cursor Allows to paginate through results. If omitted it returns first page

We also know that Twitter returns the following JSON response:

{ 
    has_more_items: boolean, 
    items_html: "...", 
    is_scrolling_request: boolean, 
    is_refresh_request: boolean, 
    scroll_cursor: "...", 
    refresh_cursor: "...", 
    focused_refresh_interval: int 
}

Finally, we know that we can extract the following information for each tweet:

Embedded Tweet Data
Selector Value
div.original-tweet[data-tweet-id] The authors twitter handle
div.original-tweet[data-name] The name of the author
div.original-tweet[data-user-id] The user ID of the author
span._timestamp[data-time] Timestamp of the post
span._timestamp[data-time-ms] Timestamp of the post in ms
p.tweet-text  Text of Tweet
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] Number of Retweets
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count]  Number of Favourites

Ok, recap done, let’s consider some pseudo code to get us started. As the example is going to be in Java, the pseudo code will take on a Java syntax.

searchTwitter(String query, long rateDelay) {
  URL searchURL = createSearchURL(query)
  TwitterResponse twitterResponse
  String scrollCursor
  while ( (twitterResponse = executeSearch(searchURL)) != null && twitterResponse.has_more_items && twitterResponse.scroll_cursor != scrollCursor) {
    List tweets = extractTweest(twitterResponse.items_html)
    saveTweets(tweets)
    searchURL = createSearchURL(query, twitterResponse.scroll_cursor)
    sleep(rateDelay)
 }
}

Firstly, we define a function called searchTwitter, where we pass a query value as a string, and a specified time to pause the thread between calls. Given this string, we then pass to a function that creates our search URL based on our query. Then, in a while loop, we execute the search to return a TwitterResponse object that represents the JSON Twitter returns. Checking that the response is not null, it has more items, and we are not repeating the scroll cursor, we proceed to extract tweets from the items html, save them, and create our next search URL. We finally sleep the thread for however long we choose to with rateDelay, so we are not bombarding Twitter with a stupid amount of requests that could be viewed as a very crap DDOS.

Now that we’ve got an idea of what algorithm we’re going to use, let’s start coding.

I’m going to use Gradle as a the build system, as we are going to use some additional dependencies to make things easier. You can either download it and set it up on your machine if you want, but I’ve also added a Gradle wrapper (gradlew) to the repository so you can run without downloading Gradle. All you’ll need is to make sure that you’re JAVA_HOME Path variable is set up and pointing to wherever Java is located.

Lets take a look at the Gradle file.

apply plugin: 'java'
 
sourceCompatibility = 1.7
version = '1.0'
 
repositories {
  mavenCentral()
}
 
dependencies {
  compile 'org.apache.httpcomponents:httpclient:4.3.6'
  compile 'com.google.code.gson:gson:2.3'
  compile 'org.jsoup:jsoup:1.7.3'
  compile 'log4j:log4j:1.2.17'
 
  testCompile group: 'junit', name: 'junit', version: '4.11'
}

As this is Java project, we’ve applied the java plugin. This will generate our standard directory structure that we get with Gradle and Maven projects: src/main/java src/test/java.

In addition, there are several dependencies I’ve included to help make the task a little easier. HTTPClient provides libraries that make it easier to construct URI’s, GSON is a useful JSON processing library that will allow us to convert the response query from Twitter into a Java object, and finally JSoup is an HTML parsing library that we can use to extract what we need from the inner_html value that Twitter returns to us. Finally, I’ve included JUnit, however I won’t go into unit testing with this example.

Lets start writing our code. Again, if you’re not familiar with gradle, the root for your packages should be in src/main/java. If the folders are not already there, you can auto generate, although feel free to look at the example code if you’re still unclear.

package uk.co.tomkdickinson.twitter.search;
import java.util.Date;
 
public class Tweet {
 
    private String id;
    private String text;
    private String userId;
    private String userName;
    private String userScreenName;
    private Date createdAt;
    private int retweets;
    private int favourites;
 
    public Tweet() {
    }
 
    public Tweet(String id, String text, String userId, String userName, String userScreenName, Date createdAt, int retweets, int favourites) {
        this.id = id;
        this.text = text;
        this.userId = userId;
        this.userName = userName;
        this.userScreenName = userScreenName;
        this.createdAt = createdAt;
        this.retweets = retweets;
        this.favourites = favourites;
    }
 
    public String getId() {
        return id;
    }
 
    public void setId(String id) {
        this.id = id;
    }
 
    public String getText() {
        return text;
    }
 
    public void setText(String text) {
        this.text = text;
    }
 
    public String getUserId() {
        return userId;
    }
 
    public void setUserId(String userId) {
        this.userId = userId;
    }
 
    public String getUserName() {
        return userName;
    }
 
    public void setUserName(String userName) {
        this.userName = userName;
    }
 
    public String getUserScreenName() {
        return userScreenName;
    }
 
    public void setUserScreenName(String userScreenName) {
        this.userScreenName = userScreenName;
    }
 
    public Date getCreatedAt() {
        return createdAt;
    }
 
    public void setCreatedAt(Date createdAt) {
        this.createdAt = createdAt;
    }
 
    public int getRetweets() {
        return retweets;
    }
 
    public void setRetweets(int retweets) {
        this.retweets = retweets;
    }
 
    public int getFavourites() {
        return favourites;
    }
 
    public void setFavourites(int favourites) {
        this.favourites = favourites;
    }
}
package uk.co.tomkdickinson.twitter.search;
 
import java.util.ArrayList;
import java.util.List;
 
public class TwitterResponse {
 
    private boolean has_more_items;
    private String items_html;
    private boolean is_scrolling_request;
    private boolean is_refresh_request;
    private String scroll_cursor;
    private String refresh_cursor;
    private long focused_refresh_interval;
 
    public TwitterResponse() {
    }
 
    public TwitterResponse(boolean has_more_items, String items_html, boolean is_scrolling_request, boolean is_refresh_request, String scroll_cursor, String refresh_cursor, long focused_refresh_interval) {
        this.has_more_items = has_more_items;
        this.items_html = items_html;
        this.is_scrolling_request = is_scrolling_request;
        this.is_refresh_request = is_refresh_request;
        this.scroll_cursor = scroll_cursor;
        this.refresh_cursor = refresh_cursor;
        this.focused_refresh_interval = focused_refresh_interval;
    }
 
    public boolean isHas_more_items() {
        return has_more_items;
    }
 
    public void setHas_more_items(boolean has_more_items) {
        this.has_more_items = has_more_items;
    }
 
    public String getItems_html() {
        return items_html;
    }
 
    public void setItems_html(String items_html) {
        this.items_html = items_html;
    }
 
    public boolean isIs_scrolling_request() {
        return is_scrolling_request;
    }
 
    public void setIs_scrolling_request(boolean is_scrolling_request) {
        this.is_scrolling_request = is_scrolling_request;
    }
 
    public boolean isIs_refresh_request() {
        return is_refresh_request;
    }
 
    public void setIs_refresh_request(boolean is_refresh_request) {
        this.is_refresh_request = is_refresh_request;
    }
 
    public String getScroll_cursor() {
        return scroll_cursor;
    }
 
    public void setScroll_cursor(String scroll_cursor) {
        this.scroll_cursor = scroll_cursor;
    }
 
    public String getRefresh_cursor() {
        return refresh_cursor;
    }
 
    public void setRefresh_cursor(String refresh_cursor) {
        this.refresh_cursor = refresh_cursor;
    }
 
    public long getFocused_refresh_interval() {
        return focused_refresh_interval;
    }
 
    public void setFocused_refresh_interval(long focused_refresh_interval) {
        this.focused_refresh_interval = focused_refresh_interval;
    } 
 
    public List getTweets() {
        return new ArrayList();
    }
}

You’ll notice the additional method getTweets() in TwitterResponse. For now, just return an empty ArrayList, but we will revisit this later.
In addition to these bean classes, we also want to consider an edge case where people might use this to search for an empty, null string, or the query contains characters not allowed in a URL. Therefore to handle this, we will also create a small Exception class called InvalidQueryException.

package uk.co.tomkdickinson.twitter.search;
 
public class InvalidQueryException extends Exception{
 
    public InvalidQueryException(String query) {
        super("Query string '"+query+"' is invalid");
    }
}

Next, we need to create a TwitterSearch class and it’s basic structure. An important thing to consider here is we are interested in making the code reusable, so in the example I have made this abstract with an abstract method called saveTweets. The nice thing about this is it decouples the saving logic from the extraction logic. In other words, this will allow you to implement your own save solution without having to rewrite any of the TwitterSearch code. Additionally, you might also note that I’ve specified that the saveTweets method returns a boolean. This will allow anyone extending this to provide their own exit condition, for example once a certain number of tweets have been extracted. By returning false, we can indicate in our code to stop extracting tweets from Twitter.

package uk.co.tomkdickinson.twitter.search;
 
import java.net.URL;
import java.util.List;
 
public abstract class TwitterSearch {
 
    public TwitterSearch() {
 
    }
 
    public abstract boolean saveTweets(List tweets);
 
    public void search(final String query, final long rateDelay) throws InvalidQueryException {
 
    }
 
    public static TwitterResponse executeSearch(final URL url) {
        return null;
    }
 
    public static URL constructURL(final String query, final String scrollCursor) throws InvalidQueryException {
        return null;
    }
}

Finally, lets also create a TwitterSearchImpl. This will contain a small implementation of TwitterSearch so we can test our code as we go along.

package uk.co.tomkdickinson.twitter.search;
 
import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;
 
public class TwitterSearchImpl extends TwitterSearch {
 
    private final AtomicInteger counter = new AtomicInteger();
 
    @Override
    public boolean saveTweets(List tweets) {
        if(tweets!=null) {
            for (Tweet tweet : tweets) {
                System.out.println(counter.getAndIncrement() + 1 + "[" + tweet.getCreatedAt() + "] - " + tweet.getText());
                if (counter.get() >= 500) {
                    return false;
                }
            }
        }
        return true;
    }
 
    public static void main(String[] args) throws InvalidQueryException {
        TwitterSearch twitterSearch = new TwitterSearchImpl();
        twitterSearch.search("babylon 5", 2000);
    }
}

All this implementation does is print out our tweets date and text, collecting up to a maximum of 500 where the program should terminate.

Now we have the skeleton of our project set up, lets start implementing some of the functionality. Considering our pseudo code from earlier Let’s start with TwitterSearch.class:

public void search(final String query, final long rateDelay) {
    TwitterResponse response;
    String scrollCursor = null;
    URL url = constructURL(query, scrollCursor);
    boolean continueSearch = true;
    while((response = executeSearch(url))!=null && response.isHas_more_items() && continueSearch) {
        continueSearch = saveTweets(response.getTweets());
        scrollCursor = response.getScroll_cursor();
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        url = constructURL(query, scrollCursor);
    }
}

As you can probably tell, that is pretty much most of our main pseudo code implemented. Running it will have no effect, as we haven’t implemented any of the actual steps yet, but it is a good start.

Lets implement some of our other methods starting with constructURL.

public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "scroll_cursor";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String scrollCursor) throws InvalidQueryException {
        if(query==null || query.isEmpty()) {
            throw new InvalidQueryException(query);
        }
        try {
            URIBuilder uriBuilder;
            uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
            uriBuilder.addParameter(QUERY_PARAM, query);
            uriBuilder.addParameter(TYPE_PARAM, "realtime");
            if (scrollCursor != null) {
                uriBuilder.addParameter(SCROLL_CURSOR_PARAM, scrollCursor);
            }
            return uriBuilder.build().toURL();
        } catch(MalformedURLException | URISyntaxException e) {
            e.printStackTrace();
            throw new InvalidQueryException(query);
        }
    }

First, we make a check to see if the query is valid. If not, we’re going to throw that InvalidQuery exception from earlier. Additionally, we may throw a MalformedURLException or URISyntaxexception, both caused by an invalid query string, so when caught we shall throw a new InvalidQuery exception. Next, using a URIBuilder, we build our URL using some constants we specify as variables, and the query and scroll_cursor value we pass. With our initial queries, we will have a null scroll cursor, so we also check for that. Finally, we build the URI and return as a URL, so we can use it to open up an InputStream later on.

Lets implement our executeSearch function. This is where we actually call Twitter and parse its response.

public static TwitterResponse executeSearch(final URL url) {
        BufferedReader reader = null;
        try {
            reader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream()));
            Gson gson = new Gson();
            return gson.fromJson(reader, TwitterResponse.class);
        } catch(IOException e) {
            e.printStackTrace();
        } finally {
            try {
                reader.close();
            } catch(NullPointerException | IOException e) {
                e.printStackTrace();
            }
        }
        return null;
    }

This is a fairly simple method. All we’re doing is opening up a URLConnection for our Twitter query, then parsing that response using Gson as a TwitterResponse object, serializing the JSON into a Java object that we can use. As we’ve already implemented the logic earlier for using the scroll cursor, if we were to run this now, rather than the program terminating after a few seconds, it will keep running till there is no longer a valid response from Twitter. However, we haven’t quite finished yet as we have yet to extract any information from the tweets.

The TwitterResponse object is currently holding all the twitter data in it’s items_html variable, so what we now need to do is go back to TwitterResponse and add in some code that lets us extract that data. If you remember from earlier, we added a getTweets() method to the TwitterResponse object, however it’s returning an empty list. We’re going to fully implement that method so that when called, it builds up a list of tweets from the response inner_html.

To do this, we are going to be using JSoup, and we can even refer to some of those CSS queries that we noted earlier.

public List getTweets() {
    final List tweets = new ArrayList<>();
    Document doc = Jsoup.parse(items_html);
    for(Element el : doc.select("li.js-stream-item")) {
        String id = el.attr("data-item-id");
        String text = null;
        String userId = null;
        String userScreenName = null;
        String userName = null;
        Date createdAt = null;
        int retweets = 0;
        int favourites = 0;
        try {
            text = el.select("p.tweet-text").text();
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        try {
            userId = el.select("div.tweet").attr("data-user-id");
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        try {
            userName = el.select("div.tweet").attr("data-name");
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        try {
            userScreenName = el.select("div.tweet").attr("data-screen-name");
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        try {
            final String date = el.select("span._timestamp").attr("data-time-ms");
            if (date != null && !date.isEmpty()) {
                createdAt = new Date(Long.parseLong(date));
            }
        } catch (NullPointerException | NumberFormatException e) {
            e.printStackTrace();
        }
        try {
            retweets = Integer.parseInt(el.select("span.ProfileTweet-action--retweet > span.ProfileTweet-actionCount")
                    .attr("data-tweet-stat-count"));
        } catch(NullPointerException e) {
            e.printStackTrace();
        }
        try {
            favourites = Integer.parseInt(el.select("span.ProfileTweet-action--favorite > span.ProfileTweet-actionCount")
                    .attr("data-tweet-stat-count"));
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        Tweet tweet = new Tweet(
                id,
                text,
                userId,
                userName,
                userScreenName,
                createdAt,
                retweets,
                favourites
        );
        if (tweet.getId() != null) {
            tweets.add(tweet);
        }
    }
    return tweets;
}

Let’s discuss what we’re doing here. First, we’re create a JSoup document from the items_html variable. This allows us to select elements within the document using css selectors. Next, we are going through each of the li elements that represent each tweet, and then extracting all the information that we are interested in. As you can see, there’s a number of catch statements in here as we want to check against edge cases where particular data items might not be there (i.e. user’s real name), while at the same time not using an all encompassing catch statement that will skip tweets if it is just missing a singular piece of information. The only value that we require to save the tweet here is the tweetId, as this allows us to fully extract information about the tweet later on if we so want. Obviously, you can modify this section to your hearts content to meet your own rules.

Finally, lets re run our program again. This is the final time, and you should now see tweets being extracted and printed out. That’s it. Job done, finished!

Obviously, there are many ways this code can be improved. For example, a more generic error checking methodology could be implemented to check against missing attributes (or you could just use groovy and ?). You could implement runnable in the TwitterSearch class to allow multiple calls to Twitter with a ThreadPool (although, I stress respect rate limits). You could even change TwitterResponse so it serializes the tweets as a list on creation, rather than extracting them from items_html each time you access them.

17 Comments

  1. Dmytro Dmytro

    Great stuff, thanks for sharing! There are few HTML-related typos in the code, e.g., “>” instead of “>”, “&&” instead of “&&”, but otherwise it works as it is.

    One strange thing I’ve noticed is that this loop:

    >>while((response = executeSearch(url))!=null && response.isHas_more_items() && continueSearch) {
    >>…
    >>}

    may exit sometimes, even though there are still more results available. I have fixed this (for my specific search query) by removing “response.isHas_more_items()”.

    • Ah yes, thanks for spotting those. I thought I’d gotten rid of all of them, but clearly not. I’ll edit the post later and see if I can fix it.

      As for the loop and has_more_items response you get back, good spot. I noticed that particular behaviour in the past prior to Twitter indexing all their tweets, but couldn’t replicate it after they’d changed their architecture. My guess is that parameter is possibly there for their AJAX timeline calls, so that once they receive that key, their JS stops calling for more results on their infinite scroll (even if there might be more results available). I might update the post with that assumption a bit later as well.

  2. Dmytro Dmytro

    Tom, I’ve compared the volumes of tweets that I’m collecting using this approach with some “ground truth” values I’ve got from other sources and it seems I’m getting only a sample (of less than 5%) of all tweets. I wonder if you have also noticed this sampling problem? Thanks!

  3. Ahmad Ahmad

    Hi Tom,
    This is indeed a very useful blog post. I am trying to apply the code you implemented to retrieve some old tweets that are distributed over many pages. The problem is that in the while loop, it just keep repeating the tweets found on the first page. So in every loop iteration, I am getting the same 20 tweets that are found on the first page and retrieved by the first iteration. I have also noticed while debugging that the parameter “scroll_cursor” is always null after each search request. I am not sure if Twitter has changed the name of that parameter to some other name. I have tried changing that to “cursor” and “next_cursor” (inspired by reading through the Twitter API docs) and change the corresponding field and method names in the TwitterResponse Class for the serialization to work but with no use. Can you please look into that and help me in this regard? Many thanks in advance!

    • Roger Rubens Roger Rubens

      Hi Tom,

      I’m also the same problems that AHMAD.
      I have another question: Its search between two dates?

      Thank you

      • Hi Roger,

        Added a new post that fixes the issue that Ahmad had with the scroll_cursor.

        As for searching between two dates, you can use the query “since:2015-07-13 until:2015-08-01”.

        My advice for any query for using this script is to use Twitters Advanced Search first. If you select all the parameters you want in there, and hit search, you’ll actually get your query on the next page in the search dialogue. You can then just copy and paste that into your script to perform the same query.

        Tom

        Tom

    • Hi Ahmad,

      Sorry for the month delay with this, been a bit busy. However I’ve added an update as Twitter have changed the scroll_cursor parameter to something else.

      Tom

  4. Roger Rubens Roger Rubens

    Tom,

    After a while researching I have the following exception:

    java.io.FileNotFoundException: https://twitter.com/i/search/timeline?q=dilma&f=tweets&max_position=TWEET-627689071364710400-627910748598108160
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at TwitterSearch.executeSearch(TwitterSearch.java:74)
    at TwitterSearch.search(TwitterSearch.java:32)
    at TwitterSearchImpl.main(TwitterSearchImpl.java:25)
    java.lang.NullPointerException
    at TwitterSearch.executeSearch(TwitterSearch.java:81)
    at TwitterSearch.search(TwitterSearch.java:32)
    at TwitterSearchImpl.main(TwitterSearchImpl.java:25)

    Can I consider that arrived late in the Twitter page and consequently the end of the research?

    Roger

    • Hi Roger,

      I updated the code-base to fix some issues with Twitter as detailed in this post. However, that exception you are throwing looks more like an issue connecting to Twitter, rather than previous issues with Twitter changing its format. Are you trying to run the software behind a proxy server?

      Tom

      • Roger Rubens Roger Rubens

        Hi Tom,

        I’m not using a proxy server.
        Are you not getting the same exception?
        In your TwitterSearchImpl, try to remove the limit of 500 tweets.
        I’m trying to get all possible tweets.

        Roger

        • Hi Roger,

          I managed to reproduce your error for a longer collection. Looks like a broken connection to Twitter causes the FileNotFoundException to be thrown by the BufferedReader.

          To fix this, I’ve updated the codebase so that when that happens, it’ll wait 5 seconds, then try again.

          Let me know if you run into any other issues.

          Tom

  5. I have a problem. It takes only the first 20 tw; the parameter scroll_cursor is always null.

    • Is this using the Java version of the scraper, or the python one? I remember the python version had a similar issue, but I fixed it. I might not have added the changes to the Java one though.

  6. alison alison

    hi tom , I have a question about how should I know other selectors for twitter other attributes, like the number of followers, who retweet the tweets, etc. I need more information from the file. Thanks a lot! Happy new year !

    • Some of this information is really difficult to obtain. My advice would be to use the Twitter API as much as you can to get things like number of followers and users. Any tweet ids, or user ids you extract with the application, can be used to get some extra information from the API. However, some things like who’s re-tweeted a tweet is limited. In the past, I’ve written scrapers that can extract the last 25 users who have re-tweeted something, but that’s the maximum I can extract. You have to take a look at which AJAX calls are made on Twitter, and write a scraper for them. For example, for the retweets, Twitter makes this call: “https://twitter.com/i/activity/retweeted_popup?id=683777665258618880”, where the id is the id of the tweet. Given that, you can then parse the result using a mixture of parsing the JSON using Gson, and the HTML using Jsoup. I’ve been meaning to write a blog post on some of this stuff, so I’ll try and get around to it some time over the next week or so.

  7. Chris Chris

    Hi Tom,

    Thank you for the post! It works great when I just type in the word or hashtag I’m trying to search for, but the since and until doesn’t seem to be working for me (ex. twitterSearch.search(“burningman since:2015-08-30 until:2015-09-06”, 2); doesn’t return any results). Would love for any pointers!

  8. dino dino

    Hi Tom. Thank you very much for this, I used the python code to grab tweets from Twitter. I lament not being able to extract same tweets through the API

    I was wondering though, has anybody gotten into trouble (legal, IP blocking, etc) with Twitter by using this method? I know that this isn’t expressly allowed.

    Thanks again!
    Dino

Leave a Reply