Skip to content

Extracting a Larger Twitter Dataset

All credit for this method should go to https://github.com/simonlindgren who shared a script of his that motivated me to implement it into this work.

One of the issues with the current implementation of extracting Tweets from Twitters search API, is at any point during the extraction process, Twitter can stop returning results, and the chain of calls is broken. This behaviour is unpredictable, and most likely down to an automated feature in Twitters backend.

An alternative to this is to slice your queries up into time intervals, so rather than perform one large search, you perform several much smaller searches, but your search criteria includes a since and until date parameter, that only extracts tweets for a given day.

This has two major advantages. Firstly, it helps to mitigate the issue where your search chain might be broken prematurely, and secondly it allows you to include parallel data collection, vastly reducing the collection time.

To test both these hypotheses, I’ve written a new class called TwitterSlicer, which takes in a “since”, and “until” date, as well as #threads parameter to specify how many concurrent threads you want running.

For our first hypothesis on #tweets collected, I’ve used the original method TwitterSearch to search for the query “Babylon 5”. To compare, I used the TwitterSlicer class, looking for tweets just between 2016-10-01, and 2016-12-01. Technically the advantage should be with the original approach using TwitterSearch, as it has no time limit. Tabel 1 shows the results.

Table 1
Method Query Total Tweets
TwitterSearch Babylon 5 127
TwitterSlicer Babylon 5 3720

As can be seen, the original query only extracts 127 tweets, where as the TwitterSlicer method extracts 3720.

To test our second hypothesis, I modify the query for TwitterSearch to search between the same date parameters as the TwitterSlicer method. Table 2 shows the results:

Table 2
Method Query Threads Total Tweets Time Taken
TwitterSearch Babylon 5 since:2016-10-01 until:2016-12-01 1 3720 138 seconds
TwitterSlicer Babylon 5 5 3720 45 seconds

Interestingly, adding the date parameters has now fixed the issue with TwitterSearch, and has colleced the same number of tweets as the TwitterSlicer method. However, the big difference here is the TwitterSlicer approach was about 3 times faster, highlighting the second advantage.

Given the nature of the collection process, there is also no reason why the approach couldn’t be implemented using something like Hadoop and Map Reduce, to further scale up collection time.

For those interested in using the modified script, I’ve added a new class to the TwitterScraper.py. As mentioned before, the new class is called TwitterSlicer, and takes in a “since” and “until” parameter, which should be a datetime.datetime class, as well as an #threads parameter to indicate how many concurrent threads you want collecting data. As it uses a ThreadPoolExecutor for parallel tasks, I’ve switched the master branch to use python3. However, I’ve also created a python2 branch, which contains the new code, but does not have the parallel option at the moment.

Python 3, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/master/TwitterScraper.py
Python 2, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/python2/TwitterScraper.py

8 Comments

  1. kk kk

    I tried your scripts for crawl tweets data,but it returns many duplicated tweets.

    My search query is “trump since:2016-08-01 until:2016-08-02”

    INFO:root:694 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:713 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:732 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:751 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:770 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:789 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:808 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:827 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:846 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:865 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:884 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …
    INFO:root:903 [2016-08-02 09:59:59] -29015670-bgmama58- Not enough! McCain is still supporting Trump the slanderer, supporter of Putin and has “Palin” knowledge! Scary! https://twitter.com/alivitali/status/760088322454814720 …

    I think there may be some problem in method perform_search.

    • I’ve had a chance to take a quick look this weekend. Looks like they’ve changed some of their API backend, so I’ve put in a quick fix for now.

      For me, it still duplicates the first call, but then it should start collecting new tweets after that.

      I’ll take a look later to see if I can remove that as well.

      • kk kk

        Hi Tom,
        Thanks for you reply.
        You mentioned the backend of Twitter API change, what is that?
        BTW, I used the fake_useragent to generate multi user agent for anti-block.

        Thanks.

  2. Roger J Roger J

    Hi Thanks for your post. It was very helpful. You and Simon Lindgren’ s scripts work great until today. It seems like something happened (maybe twitter changes their search setting) last night. Now I can only extract tweets approximately from recent two weeks (just like using Twitter API). I am not sure you can take a look into this. Thank you so much!!

    • Hi Rodger,

      I’ve not used the script much in recent months, so there’s always the chance Twitter has changed something in that time.

      I’ll give it a quick look over this weekend and see if anythings changed.

      Tom

      • Roger J Roger J

        Thanks Tom. It seems like Twitter will stop me from extracting historic tweets using this scripts after I have done it multiple times. Using a different IP address seems OK to circumvent the ban at the moment. I am not sure if that is the case for you.

        • Tiago Santos Tiago Santos

          Thanks for your post TOm, it helped me a lot. I was having the same problem but I’ve added the cookies with cookielib to the headers of the request and it is working now.

  3. Malik Malik

    Hey Tom,

    It’s 2017 and I’m trying to make simple requests as you have done in your library both with urllib and requests. However, Twitter doesn’t seem to return a JSON response at all whatsoever so I’m not even able to parse these desired fields. Do you have any ideas?

    Thanks,

    Malik

Leave a Reply