Combining WordNet and ConceptNet in Neo4j

WordNet and ConceptNet are two popular databases of words and concepts that are used in a number of AI applications. This post will look at how we can combine the two of them into one searchable graph using Neo4j.

Before I start, some people may view this as a redundant task because ConceptNet already ingests WordNet, so why not just stick with loading ConceptNet into Neo4j?

While this is true, WordNet models its relationships at the synset level, while ConceptNet seeks to undo this (https://github.com/commonsense/conceptnet5/wiki/FAQ). This exercise is to combine both these types of abstractions into one large graph that we can use, allowing us to create graph queries against either network, or create queries that can combine both.

The rest of this post will give a quick overview of WordNet and ConceptNet, the graph model I’ve developer so far, a python script that will generate some import files for Neo4j, and finally some example cypher queries. For those just interested in the python script, you can find it at https://github.com/tomkdickinson/wordnet_conceptnet_neo4j. Continue reading

A Python Wrapper for Mining Frequent Graphs Using ParSeMiS

As part of my PhD, I’ve been doing a lot of frequent graph mining, and thought I’d share some of the scripts I’ve written to mine frequent graphs using an already existing library called ParSeMiS.

ParSeMiS is a Java library that implements the gSpan algorithm (as well as a few others) for detecting frequent subgraphs in graph datasets.

The library takes two main inputs:

  • An input file, that contains a list of graphs
  • A minimum support where s is either an integer > 0, or percentage, where a “subgraph” is considered frequent, if it occurs in more than s graphs

As it’s a Java library, I’ve written a small wrapper in Python to interface with it.

The wrapper uses NetworkX to manage graphs, and takes as input a list of NetworkX graphs. Currently it only deals with directed graphs, but when I get a chance, I’ll add in undirected graph support as well.

If you haven’t used NetworkX before, the choice is mainly to do with its read/write functionality. This allows you to create your graphs in a variety of different formats, and then have NetworkX load them. As the wrapper will return a list of NetworkX graphs, you can then write them to disk (or handle them yourself).

At some point in the future, I may update it to use graph-tools instead, as graph-tools has much better support for things like detecting subgraph isomorphism, and handling larger graphs in general.

The library can be found at: https://github.com/tomkdickinson/parsemis_wrapper

I’ve included a small example that shows how the wrapper works.

Twitter Search Example in Python

This post is response to a few comments. Firstly, as some people had requested it, I’ve re-written the example from here in Python. I’ll be the first to admit, my Python skills aren’t fantastic, but I tested it against collecting 5k tweets and it appeared to work.

The library can be found at https://github.com/tomkdickinson/Twitter-Search-API-Python. Similar to the Java one, it defines an abstract class you can extend from and write your own custom implementation for what you do with the Tweets on each call to Twitter.

Continue reading

Scraping Tweets Directly from Twitters Search – Update

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script. If you are just interested in grabbing the code, I’ve pushed the changes to https://github.com/tomkdickinson/TwitterSearchAPI, but feel free to read on if you want to know what’s been changed and why.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
    if(minTweet==null) {
        minTweet = response.getTweets().get(0).getId();
    }
    continueSearch = saveTweets(response.getTweets());
    String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
    if(!minTweet.equals(maxTweet)) {
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
        url = constructURL(query, maxPosition);
    }
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
    if(query==null || query.isEmpty()) {
        throw new InvalidQueryException(query);
    }
    try {
        URIBuilder uriBuilder;
        uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
        uriBuilder.addParameter(QUERY_PARAM, query);
        uriBuilder.addParameter(TYPE_PARAM, "tweets");
        if (maxPosition != null) {
            uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
        }
        return uriBuilder.build().toURL();
    } catch(MalformedURLException | URISyntaxException e) {
        e.printStackTrace();
        throw new InvalidQueryException(query);
    }
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;

Extracting a Users Twitter Timeline Above the 3.2k Limit

Those familiar with the Twitter API know the annoyance of limits to number of tweets extracted being at 3,200. For most of us who don’t tweet that often, this isn’t an issue, but sometimes we might find we need to extract more tweets.

Following on from my previous post (many months I know, but I’ve been busy with PhD work), Twitters indexing upgrade actually gives us the opportunity to extract more tweets than this, if the users twitter feed is public.

Using the same methodology as previously stated, we can search all original tweets that a user has made using the query:

from:screenname

So for example, to extract all tweets I’ve ever created, you can use from:tomkdickinson.

However, the major restriction with this method is it does not include retweets. Even including the “include:retweets” parameter does not seem to change this.

Scraping Tweets Directly from Twitters Search Page – Part 2

UPDATE The code outlined here won’t work directly with Twitter now as they have updated their source code. Thankfully it only takes a few small changes to modify this script for it to work, which I’ve outlined here.

In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API. This post now deals with an example implementation in Java, with an example git repository that you can improve on, or use yourself. You can find the git repository over at https://github.com/tomkdickinson/TwitterSearchAPI.

Continue reading

Scraping Tweets Directly from Twitters Search Page – Part 1

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.

EDIT 2 – A useful update to the python version of this script, that allows larger datasets to be collected can be found here.

In fairly recent news, Twitter has started indexing it’s entire history of Tweets going all the way back to 2006. Hurrah for data scientists! However, even with this news (at time of writing), their search API is still restricted to the past seven days of Tweets. While I doubt this will be the case permanently, as a useful exercise this post presents how we can search for Tweets from Twitter without necessarily using their API. Besides the indexing, there is also the advantage that Twitter is a little more liberal with rate limits, and you don’t require any authentication keys.

The post will be split up into two parts, this first part looking at what we can extract from Twitter and how we might start to go about it, and the second a tutorial on how we can implement this in Java.

Continue reading