Using Word2Vec from Clojure

For classifying or clustering data in the context of a machine learning problem, the first step is to create a representation of data, usually called the Feature Vector. Datasets consisting of images or audio files have feature vectors that are already in numeric form. If we have text data, we have to convert words /characters into numbers.

For a number of years, the Bag of words approach was used to create a Feature Vector. This approach required the use of a dictionary which contains all the words used in the dataset.

Assume that we have a dictionary consisting of the words {“the”, “sleepy”,”happy”,”cat”,”dog”}. If we encounter 2 sentences :”the sleepy cat” and “the happy dog”, we replace the words with the index in the dictionary. Thus “the sleepy cat” becomes “0,1,3”, and “the happy dog” translates to “0,2,5” .

The problem with this approach is that it ignores similarities between words. For example, ‘Cricket’, ‘Batsman’ & ‘Bowler’ are related terms from the game of Cricket, but the use of Bag of words gives each of the words a different number which doesn’t convey that they are related.

Word2Vec is a tool developed by Mikolov et al which is capable of generating feature vectors from text data, and these feature vectors encode relationships that words share with each other. When given a (word’s) feature vector, we can find words that are similar to the given word by using a distance function.

Multiple implementations of Word2Vec are available, and in this post we introduce a Clojure wrapper to a Java implementation of Word2vec. The source and the documentation are available at Github.

Read the entire blog here

This blog is written by Kiran Karkera, Analytics Project Manager at BRIDGEi2i