Skip to content

Scraping Tweets Directly from Twitters Search Page – Part 1

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.

EDIT 2 – A useful update to the python version of this script, that allows larger datasets to be collected can be found here.

In fairly recent news, Twitter has started indexing it’s entire history of Tweets going all the way back to 2006. Hurrah for data scientists! However, even with this news (at time of writing), their search API is still restricted to the past seven days of Tweets. While I doubt this will be the case permanently, as a useful exercise this post presents how we can search for Tweets from Twitter without necessarily using their API. Besides the indexing, there is also the advantage that Twitter is a little more liberal with rate limits, and you don’t require any authentication keys.

The post will be split up into two parts, this first part looking at what we can extract from Twitter and how we might start to go about it, and the second a tutorial on how we can implement this in Java.

Right, to begin, lets say we want to search Twitter for all tweets related to the query “Babylon 5”. You can access Twitters advanced search without being logged in: https://twitter.com/search-advanced

If we take a look at the URL that’s constructed when we perform the search we get:

https://twitter.com/search?q=Babylon%205&src=typd

As we can see, there are two query parameters, q (our query encoded) and src (assumed to be the source of the query, i.e. typed).  However, by default, Twitter returns top results, rather than all, so on the displayed page, if you click on All the URL changes to:

https://twitter.com/search?f=realtime&q=Babylon%205&src=typd

The difference here is the f=realtime parameter that appears to specify we receive Tweets in realtime as opposed to a subset of top Tweets. Useful to know, but currently we’re only getting the first 25 Tweets back. If we scroll down though, we notice that more Tweets are loaded on the page via AJAX. Logging all XMLHttpRequests in whatever dev tool you choose to use, we can see that everytime we reach the bottom of the page, Twitter makes an AJAX call a URL similar to:

https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd&include_available_features=1&include_entities=1&last_note_ts=85&scroll_cursor=TWEET-553069642609344512-553159310448918528-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

On further inspection, we see that it is also a JSON response, which is very useful! Before we look at the response though, let’s have a look at that URL and some of it’s parameters.

First off, it’s slightly different to the default search URL. The path is /i/search/timeline as opposed to /search. Secondly, while we notice our familiar parameters q, f, and src, from before, there are several additional ones. The most important and essential new one though is scroll_cursor. This is what Twitter uses to paginate the results. If you remove scroll_cursor from that URL, you end up with your first page of results again.

Now lets take a look now at the JSON response that Twitter provides:

{
has_more_items: boolean,
items_html: "...",
is_scrolling_request: boolean,
is_refresh_request: boolean,
scroll_cursor: "...",
refresh_cursor: "...",
focused_refresh_interval: int
}

Again, not all parameters for this post are important to take note of, but the ones that are include: has_more_items, items_html, and scroll_cursor.

has_more_items – This lets you know with a boolean value whether or not there are any more results after this query.

items_html – This is where all the tweets are which Twitter uses to append to the bottom of their timeline. It requires parsing, but there is a good amount of information in there to be extracted which we will look at in a minute.

scroll_cursor – A pagination value that allows us to extract the next page of results.

Remember our scroll_cursor parameter from earlier on? Well for each search request you make to twitter, the value of this key in the response provides you with the next set of tweets, allowing you to recursively call Twitter until either has_more_items is false, or your previous scroll_cursor equals the last scroll_cursor you had.

Now that we know how to access Twitters own search functionality, lets turn our attention to the tweets themselves. As mentioned before, items_html in the response is where all the tweets are at. However, it comes in a block of HTML as Twitter injects that block at the bottom of the page each time that call is made. The HTML inside is a list of li elements, each element a Tweet. I won’t post the HTML for one here, as even one tweet has a lot of HTML in it, but if you want to look at it, copy the items_html value (omiting the quotes around the HTML content) and paste it into something like JSBeautifier to see the formatted results for yourself.

If we look over the HTML, aside from the tweets text, there is actually a lot of useful information encapsulated in this data packet. The most important item is the Tweet id itself. If you check, it’s actually in the root li element. Now, we could stop here as with that ID, you can query Twitters official API, and if it’s a public Tweet, you can get all kinds of information. However, that’d defeat the purpose of not using the API, so lets see what we can extract from what we already have.

The table below shows various CSS selector queries that you can use to extract the information with.

Embedded Tweet Data
Selector Value
div.original-tweet[data-tweet-id] The authors twitter handle
div.original-tweet[data-name] The name of the author
div.original-tweet[data-user-id] The user ID of the author
span._timestamp[data-time] Timestamp of the post
span._timestamp[data-time-ms] Timestamp of the post in ms
p.tweet-text  Text of Tweet
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] Number of Retweets
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count]  Number of Favourites

That’s quite a sizeable amount of information in that HTML. From looking through, we can extract a bunch of stuff about the author, the time stamp of the tweet, the text, and number of retweets and favourites.

What have we learned here? Well, to summarize, we know how to construct a Twitter URL query, the response we get from said query, and the information we can extract from said response. The second part of this tutorial (to follow shortly) will introduce some code as to how we can implement the above.

10 Comments

  1. This piece of writing presents clear idea designed
    for the new users of blogging, that actually how
    to do blogging and site-building.

  2. I like the valuable info you supply for your articles.
    I’ll bookmark your blog and test again right here frequently.

    I am slightly certain I will be informed lots of new stuff proper here!

    Best of luck for the following!

  3. Karl Karl

    Hey Tom, an easy way to extract the author, time stamp etc. once you have the JSON is to just paste the JSON into: json-csv.com.

    It will produce a nicely formatted CSV file with each data piece in their own column.

  4. Puneet Puneet

    I am getting no scroll_cursor parameter in json response. Can you tell me how can I get the next page hit ?

    • Hi Puneet,

      Added a new post that deals with this, and updated the code base =)

      Tom

  5. Thanks a lot for the idea to scrape twitter search

  6. kishore kishore

    Thank a lot,
    when I tried “https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd”. It doesn’t give scroll_cursor. How I can get the next page content.

    • Hi,

      I need to amend this article with another more recent post – http://tomkdickinson.co.uk/2015/08/scraping-tweets-directly-from-twitters-search-update/

      Basically, there is no scroll_cursor anymore, and instead you have a max_position parameter where you use your oldest tweet ID to scroll to the next page. Check out the python project for a good example from line 50 onwards. You basically select your last (oldest) tweet from what you’ve collected, and construct the parameter as TWEET-oldest_tweet_id-earliest_tweet_id, so it’d look something like:

      max_position=TWEET-123456789-234567891

      Hope that answers your question.

  7. Hi Tom! Thanks so much for this analysis.

    I’m looking into trying to pinch this JSON data via JS, but obviously it throws a CORS error. D’oh!

    Seems like the only solution is server-side. (And, y’know, actually using the oAuth twitter keeps trying to shove down my throat).

    Thanks again!

Leave a Reply