document cosine similarity using lucene
Calculating the cosine similarity between two documents in a common task in information retrieval and is used in a number of applications, such as for ranking the similarity of documents to a search query. Since it’s such a common task, I designed a class that can easily be used to calculate the cosine similarity between any two documents using Lucene. In another post (at some time in the future) I’ll show how this method of calculating can be incorporated into Lucene as a custom ranking function. This class is based on code from: http://sujitpal.blogspot.com/2011/10/computing-document-similarity-using.html
The Basics
- Get the code from GitHub here (you need both classes)
- Compile the code (depends on lucene-core, lucene-analyzers, and Apache commons-io) (details for compiling can be found here – different code but the same procedure)
- Run the main class:
java -cp $(echo lib/*.jar | tr ' ' ':') simseer.cosine.CosineSimilarity file1 file2
How it Works
The code pretty much speaks for itself, but basically what it does it to use the Lucene analyzers to pre-process the text and then build a HashMap of all tokens that appear in both files. Recall that cosine similarity is based on the angle between two document vectors and this HashMap basically represents all the terms in the vocabulary. The DocVector class represents the individual document vectors for each document and is initialized with the HashMap representing the full vocabulary. Each entry in the individual document vectors is then updated with the weight of the token that represents that entry, the vectors are normalized, and the cosine similarity using the standard measure is calculated and returned.
More Information
Wikipedia Cosine Similarity
Read the code



















Recent Comments