Skip to content

Combining WordNet and ConceptNet in Neo4j

WordNet and ConceptNet are two popular databases of words and concepts that are used in a number of AI applications. This post will look at how we can combine the two of them into one searchable graph using Neo4j.

Before I start, some people may view this as a redundant task because ConceptNet already ingests WordNet, so why not just stick with loading ConceptNet into Neo4j?

While this is true, WordNet models its relationships at the synset level, while ConceptNet seeks to undo this (https://github.com/commonsense/conceptnet5/wiki/FAQ). This exercise is to combine both these types of abstractions into one large graph that we can use, allowing us to create graph queries against either network, or create queries that can combine both.

The rest of this post will give a quick overview of WordNet and ConceptNet, the graph model I’ve developer so far, a python script that will generate some import files for Neo4j, and finally some example cypher queries. For those just interested in the python script, you can find it at https://github.com/tomkdickinson/wordnet_conceptnet_neo4j.

WordNet

WordNet is a lexical database for the English language and can be thought of as a combination of a thesaurus and a dictionary, with relationships between groups of words. There are two main categories in WordNet, a synset, and a lemma.

Lemmas are are the canonical root of a word. E.g. run, runs, ran, and running all have the same root ‘run’, and thus would be represented in WordNet as ‘run’.

Synsets are groups of lemmas that could be considered interchangeable. For example, we might look up the synset ‘wedding.n’ and our first synset contains a group of lemmas “wedding, wedding ceremony, nuptials, hymeneals”. There also exists a few synsets with wedding as a verb, the first of which contains the following lemmas: “marry, get married, wed, conjoin, hook up with, get hitched with, espouse”. There also exist other types of entities within WordNet, but for this brief description I won’t go into them.

In addition to storing synsets and lemmas, WordNet also contains relationships between Synsets. For example, hypernyms and hyponyms constitute an “IS_A” relationship between two synsets (hypernyms being parents, and hyponyms being children). E.g. Colour is a hypernym of purple, which in turn is a hyponym of colour.

ConceptNet

ConceptNet deals specifically with concepts and assertions between concepts. For example, there is the concept of a “dog”, and the concept of a “kennel”. As a human, we know that a dog goes in a kennel. ConceptNet records that assertion with /c/en/dog  /r/AtLocation /c/en/kennel.

A concept is constructed via it’s URI, /c/en/dog where “c” stands for concept, “en” is the language, and “dog” is the concept. In addition, you can sometimes get a concept like /c/en/dog/n where “n” stands for noun.

Similarly, a relationship is denoted with /r/ followed by the relationship name. A list of common relationships can be found at: https://github.com/commonsense/conceptnet5/wiki/Relations

Our Graph Model

The current iteration of this model is pretty simple with just three label types: “Synset”, “Lemma”, “Concept”.

Synsets represent the grouping of Lemmas, and have relationships between other Synsets and Lemmas.

Lemmas and Concepts can be the same node, and have relationships between other Concepts and Synsets depending on how they are grouped.

Synset Properties

Property Name Description
id The synsets id in WordNet
pos The part of speech associated with the synset
definition The description (gloss) of the synset

Lemma Properties

Property Name Description
name Lemma name
pos The part of speech inherited from it’s parent synset

Concept Properties

Property Name Description
name Lemma version of concept
pos The part of speech for the concept (if it has one)
conceptUri The URI of the concept (e.g. /c/en/wedding)

Relationshis Properties

Property Name Description
dataset The dataset the assertion was generated from
weight The weight related to the assertion (and set to default of 2.0 for all of WordNet)

WordNet Relationship Types

WordNet Relationship Graph Model Relationship
hyponym IsA
hypernym IsA
Member Holonym PartOf
Substance Holonym PartOf
Part Holonym PartOf
Member Meronym PartOf
Substance Meronym PartOf
Part Meronym PartOf
Topic Domain Domain
Region Domain Domain
Usage Domain Domain
Attribute Attribute
Entailment Entailment
Causes Causes
Also See AlsoSee
Verb Group VerbGroup
Similar To SimilarTo

Python Script

As mentioned at the start of the post, this script can be found at: https://github.com/tomkdickinson/wordnet_conceptnet_neo4j.

To help merge these two databases together, I’ve written a quick python script that produces a few csv files that can be imported into Neo4j using it’s import tool.

The first part of the script uses NLTK, and extracts nodes directly from WordNet using its python interface. In order to run it, you’ll need to make sure you’ve ran pip install nltk, and downloaded the wordnet dataset.

The WordNet script starts by loading all synsets, creating nodes from them, then starts to add all of its relationships. It will then start to create Lemma nodes based on those contained in that synset.

The second part then extracts ConceptNet. You’ll need to download ConceptNet from https://github.com/commonsense/conceptnet5/wiki/Downloads. By default, the script expects the csv.gz file to be in the same directory, but you can modify this by changing the ‘concept_location’ parameter in the constructor. Also by default, this script will only include ‘en’ concepts, but you can again change this with the ‘language_filter’, and setting it to None if you want to include everything. A word of warning though, this script is not designed to be memory efficient. With just ‘en’ it will need about 3gb of memory. If you supply None for the language parameter, you may hit issues.

The script just loops through the downloaded csv.gz file, and will create concepts based on the assertions included in it, which are effectively triples with a start concept, an end concept, and a relationship between the two. For each concept, it will check if a lemma with that name and pos exists, and if it does, rather than adding a new lemma, it will append the “Concept” label to the lemma that already exists and add any additional ConceptNet relationships.

After running, the output will be generated in a new directory called neo4j_csv_imports. There will be three files: relationships.csv, synsets.csv, and words.csv. These can be then imported into Neo4j using the neo4j-admin tool.

The command will look a bit like:

./neo4j-admin import --database={NEO4J_LOCATION}/data/databases/wordnet_conceptnet.db
--nodes={PATH_TO_DATASET_FOLDER}/synsets.csv
--relationships={PATH_TO_DATASET_FOLDER}/relationships.csv
--nodes={PATH_TO_DATASET_FOLDER}/words.csv

Example Queries

As a preface to this section, I’m by no way a cypher expert, so these are almost certainly not the most efficient queries and they may be better ways of getting the same data.

To speed up our queries, it’s probably useful to create indexes on some of our nodes:

CREATE INDEX ON :Synset(id)
CREATE INDEX ON :Lemma(name, pos)
CREATE INDEX ON :Concept(conceptUri)

First lets find the synset wedding:

MATCH (w:Synset {id: "wedding.n.01"}) RETURN w


This returns the first noun wedding synset.

We can also find the synset bride:

MATCH (b:Synset {id: "bride.n.01"}) RETURN b


Between these two, lets find the shortest path using just Synsets:

MATCH
(w:Synset {id: "wedding.n.01"}), (b:Synset {id: "bride.n.01"}),p = shortestPath((w)-[*]-(b))
WHERE ALL ( n IN nodes(p) WHERE "Synset" IN labels(n))
RETURN p


This returns a path length of 9, which is quite long. However, now that we’ve added in ConceptNet, lets try and find a path between the two synsets, including our Lemma/Concept Nodes

MATCH (w:Synset {id: "wedding.n.01"}), (b:Synset {id: "bride.n.01"}),p = shortestPath((w)-[*]-(b))
WHERE ALL ( n IN nodes(p) WHERE "Concept" IN labels(n) or "Lemma" IN labels(n) or "Synset" in labels(n))
RETURN p

This path is now only of length 5, which is much shorter.

In addition to searching between synsets, we could also consider using ConceptNet within the same graph, searching for a path directly between just those two concepts.

MATCH (w:Concept {conceptUri: "/c/en/wedding/n"}), (b:Concept {conceptUri: "/c/en/bride/n"}),p = shortestPath((w)-[*]-(b))
WHERE ALL ( n IN nodes(p) WHERE "Concept" IN labels(n))
RETURN p

This now returns a path length of 3.

Rather than just searching for a single shortestPath, we could consider all the shortest paths between the two nodes with a relationship weight greater or equal to 1 to create a connective subgraph:

MATCH (w:Concept {conceptUri: "/c/en/wedding/n"}), (b:Concept {conceptUri: "/c/en/bride/n"}), p = allShortestPaths((w)-[*]-(b))
WHERE ALL ( n IN nodes(p) WHERE "Concept" IN labels(n)) and ALL ( r in rels(p) WHERE r.weight >= 1)
RETURN p

Finally, we could remove our part of speech from our tags, and consider just the path length between  ‘/c/en/wedding’ and a ‘/c/en/bride’.

MATCH (w:Concept {conceptUri: "/c/en/wedding"}), (b:Concept {conceptUri: "/c/en/bride"}),p = shortestPath((w)-[*]-(b))
RETURN p

Now our path length is now down to 1.

Hopefully some of these queries help show the flexibility of combing the two graphs into one, and how you could start to look at creating cross database queries of your own.

Be First to Comment

Leave a Reply