Extracting a Larger Twitter Dataset

All credit for this method should go to https://github.com/simonlindgren who shared a script of his that motivated me to implement it into this work.

One of the issues with the current implementation of extracting Tweets from Twitters search API, is at any point during the extraction process, Twitter can stop returning results, and the chain of calls is broken. This behaviour is unpredictable, and most likely down to an automated feature in Twitters backend.

An alternative to this is to slice your queries up into time intervals, so rather than perform one large search, you perform several much smaller searches, but your search criteria includes a since and until date parameter, that only extracts tweets for a given day.

This has two major advantages. Firstly, it helps to mitigate the issue where your search chain might be broken prematurely, and secondly it allows you to include parallel data collection, vastly reducing the collection time.

To test both these hypotheses, I’ve written a new class called TwitterSlicer, which takes in a “since”, and “until” date, as well as #threads parameter to specify how many concurrent threads you want running.

For our first hypothesis on #tweets collected, I’ve used the original method TwitterSearch to search for the query “Babylon 5”. To compare, I used the TwitterSlicer class, looking for tweets just between 2016-10-01, and 2016-12-01. Technically the advantage should be with the original approach using TwitterSearch, as it has no time limit. Tabel 1 shows the results.

Table 1
Method Query Total Tweets
TwitterSearch Babylon 5 127
TwitterSlicer Babylon 5 3720

As can be seen, the original query only extracts 127 tweets, where as the TwitterSlicer method extracts 3720.

To test our second hypothesis, I modify the query for TwitterSearch to search between the same date parameters as the TwitterSlicer method. Table 2 shows the results:

Table 2
Method Query Threads Total Tweets Time Taken
TwitterSearch Babylon 5 since:2016-10-01 until:2016-12-01 1 3720 138 seconds
TwitterSlicer Babylon 5 5 3720 45 seconds

Interestingly, adding the date parameters has now fixed the issue with TwitterSearch, and has colleced the same number of tweets as the TwitterSlicer method. However, the big difference here is the TwitterSlicer approach was about 3 times faster, highlighting the second advantage.

Given the nature of the collection process, there is also no reason why the approach couldn’t be implemented using something like Hadoop and Map Reduce, to further scale up collection time.

For those interested in using the modified script, I’ve added a new class to the TwitterScraper.py. As mentioned before, the new class is called TwitterSlicer, and takes in a “since” and “until” parameter, which should be a datetime.datetime class, as well as an #threads parameter to indicate how many concurrent threads you want collecting data. As it uses a ThreadPoolExecutor for parallel tasks, I’ve switched the master branch to use python3. However, I’ve also created a python2 branch, which contains the new code, but does not have the parallel option at the moment.

Python 3, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/master/TwitterScraper.py
Python 2, https://github.com/tomkdickinson/Twitter-Search-API-Python/blob/python2/TwitterScraper.py