Twitter Search Example in Python

This post is response to a few comments. Firstly, as some people had requested it, I’ve re-written the example from here in Python. I’ll be the first to admit, my Python skills aren’t fantastic, but I tested it against collecting 5k tweets and it appeared to work.

The library can be found at https://github.com/tomkdickinson/Twitter-Search-API-Python. Similar to the Java one, it defines an abstract class you can extend from and write your own custom implementation for what you do with the Tweets on each call to Twitter.

Continue reading

An Intro to Gradle

As I use Gradle in a lot of my projects these days, I thought I would write a few posts about my experiences using it. This first introductory post won’t go into much detail, so if you already know how to use Gradle, even if it’s just the very basics, I would skip this post for now.

So why Gradle?

Well for me, there are two main reasons:

  1. It’s more adaptable than other build languages like maven. It’s commonly referred to as the sweet spot between Maven and Ant, due to it having the useful dependency and life-cycle features that Maven has, as well as the task based options that you get with Ant.
  2. The readability of gradle build scripts are a lot easier than both Ant and Maven (at least in my opinion).

To get started with Gradle, you’ll need a copy of Gradle installed which can be found here. The latest version will do for now (at time of writing, latest version was 2.6). Unzip it to somewhere on your machine and add to your path.

Add Gradle to Path on Windows

  1. Windows press “Start + R”
  2. Type “control /name microsoft.system” and hit enter
  3. Go to “Advanced System Settings”
  4. Go to “Environment Variables”
  5. Under “System Variables”, select “Path” and hit edit
  6. Append this to the end of the variable value: “;\path\to\gradle\bin”
  7. Exit everything you opened
  8. Open a command prompt and type “gradle -version” to check it’s working
Add Gradle to Path on Linux

  1. Open up a text editor.
  2. Open the file ~/.bashrc
  3. At the bottom, add: export PATH=$PATH:/path/to/gradle/bin/:.;
  4. Save and close
  5. Open up a terminal, and type “gradle -version” to check it’s working

With Gradle installed, all you now need is a build.gradle file, which defines your Gradle build.

Here’s an example that IntelliJ IDEA generates for you when you want a new Gradle Java project:

group 'uk.co.tomkdickinson.scraper'
version '1.0-SNAPSHOT'
 
apply plugin: 'java'
 
sourceCompatibility = 1.7
 
repositories {
    mavenCentral()
}
 
dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'
}

Lets take a look at the different components that make the build script:

group 'uk.co.tomkdickinson.scraper'
version '1.0-SNAPSHOT'

These are standard gradle properties that indicate the group, and version number of your application. You can find more details on these and other variables in section 13.2.1 of the gradle documentation.

apply plugin: 'java'

Plugins in gradle are what specifies the default build logic for your project. Here we apply the Java plugin which defines a basic Java project. This tells Gradle that the following locations contain your Java code:

Java classes

src/main/java/

Java Unit Tests

src/test/java/

When specifying plugins, there are additional properties you can add to your build.gradle file.

For example:

sourceCompatibility = 1.7

specifies that the source compatibility that gradle uses when compiling your Java classes, in this case, java version 1.7. More of these types of properties can be found in the Gradle documentation for the Java Plugin.

The last part of this build script is all about dependency management. If you are familiar already with dependency management, then Gradle is compatible out of the box with both Maven and Ivy. For those who are not, dependency management is a really useful tool to automatically include other libraries in your project. In the context of Maven, you specify a groupName, artifactId, and version of a particular library you want to include in your project. When your project builds, it will then automatically fetch this library and makes it available in your project.

Lets take a look at our example:

repositories {
    mavenCentral()
}
 
dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'
}

Here we notice two sections: repositories, and dependencies.

Repositories allows us to specify sites that host libraries (called a nexus). mavenCentral() is a built in method in gradle that specifies the central repository. If you want to include local maven artifacts that you may have installed locally, you can include mavenLocal(), or if you want to add a different repository altogether, you can specify:

mvn {
    url "http://yourrepository"
}

Dependencies are where you link your libraries. In our example, IDEA has automatically included junit. As it’s a library we only want when running our unit tests, we specify “testCompile”. This library then won’t be bundled with our final jar when built.

For any other dependency you want to include in your application, you specify:

    compile group: 'groupName', name 'artifcatName', version: 'versionNumber'

To test that the build file works, you can just run “gradle build” from the command line, when you’re in the same directory as your build.gradle

That’s the very basics for using Gradle. I’ll follow up with some more useful functionality in the near future, rather than just constructing a skeleton build file.

Scraping Tweets Directly from Twitters Search – Update

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script. If you are just interested in grabbing the code, I’ve pushed the changes to https://github.com/tomkdickinson/TwitterSearchAPI, but feel free to read on if you want to know what’s been changed and why.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
    if(minTweet==null) {
        minTweet = response.getTweets().get(0).getId();
    }
    continueSearch = saveTweets(response.getTweets());
    String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
    if(!minTweet.equals(maxTweet)) {
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
        url = constructURL(query, maxPosition);
    }
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
    if(query==null || query.isEmpty()) {
        throw new InvalidQueryException(query);
    }
    try {
        URIBuilder uriBuilder;
        uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
        uriBuilder.addParameter(QUERY_PARAM, query);
        uriBuilder.addParameter(TYPE_PARAM, "tweets");
        if (maxPosition != null) {
            uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
        }
        return uriBuilder.build().toURL();
    } catch(MalformedURLException | URISyntaxException e) {
        e.printStackTrace();
        throw new InvalidQueryException(query);
    }
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;