getting term frequency, document frequency, tf-idf and other statistics from solr

March 28th, 2015 No comments

Apache Solr

Apache Solr provides a set of very simple methods to access term frequency (TF), document frequency (DF), TF-IDF and other statistics for an indexed collection by using FunctionQueries. While FunctionQueries are documented online, it isn’t obvious how to use them, so here’s a short cheatsheet for Solr 4.

Query Formulation

Queries should be formulated as normal for Solr, for each result returned the desired statistics will be part of the returned document. For instance, if you’re interested in term frequency of a specific term then each result will contain the frequency of that term in the relevant document. Similarly, if you’re interested in the number of documents in the index, then each result will contain the number of documents. Thus, when you’re interested in document specific queries you should make sure that the documents you’re interested in are matched by the query. Similarly, if you’re only interested in collection statistics, then it is sufficient for only one document in the collection to match your query.

Queries are in the form of:

All documents that match queryTerm will be returned and, for each result, the fields specified in fl= will be returned. In the example above, we return the document id and the output of some functionQuery.

Term Frequency of “solr” in “text” field:

This will return all documents that contain the term solr as well as the frequency of the term “solr” in the “text field

Document Frequency of “solr” in “text” field

We query with * since we don’t mind what matches, the document frequency is specified with the docfreq() function query and we limit the number of rows to 1 since document frequency of a term is result independent.

Inverse Document Frequency of “solr” in “text” field

The same idea as docfreq.

There are many function queries supported by Solr as listed on the Wiki page for function queries. I personally use the DF function query to be able to quickly get document frequencies when I need to do TF-IDF weighting in other applications since all you need to do is index the documents and then let Solr do the rest.

document cosine similarity using lucene

June 4th, 2013 No comments

Calculating the cosine similarity between two documents in a common task in information retrieval and is used in a number of applications, such as for ranking the similarity of documents to a search query. Since it’s such a common task, I designed a class that can easily be used to calculate the cosine similarity between any two documents using Lucene. In another post (at some time in the future) I’ll show how this method of calculating can be incorporated into Lucene as a custom ranking function. This class is based on code from:

The Basics

  1. Get the code from GitHub here (you need both classes)
  2. Compile the code (depends on lucene-core, lucene-analyzers, and Apache commons-io) (details for compiling can be found here – different code but the same procedure)
  3. Run the main class: 
    java -cp $(echo lib/*.jar | tr ' ' ':') simseer.cosine.CosineSimilarity file1 file2

How it Works

The code pretty much speaks for itself, but basically what it does it to use the Lucene analyzers to pre-process the text and then build a HashMap of all tokens that appear in both files.  Recall that cosine similarity is based on the angle between two document vectors and this HashMap basically represents all the terms in the vocabulary.  The DocVector class represents the individual document vectors for each document and is initialized with the HashMap representing the full vocabulary. Each entry in the individual document vectors is then updated with the weight of the token that represents that entry, the vectors are normalized, and the cosine similarity using the standard measure is calculated and returned.

More Information

Wikipedia Cosine Similarity
Read the code


jenkins hash in java

April 9th, 2013 No comments

I spent a few hours today implementing Jenkins Hash function [1][2] in Java. The implementation is based on a python implementation of the hash, which in turn is based on the original C implementation.

The code uses BigInteger since Long is not always large enough – the original C code uses a long long.

I haven’t written any unit tests yet, but I tested it with by hashing every word in 100 documents and the hash values were always exactly the same as those produced by the original C implementation. I’ll release some test cases when I get a chance to write some.

The code is available at GitHub


n-gram extraction using lucene

March 21st, 2013 No comments

Extracting n-grams from a piece of text is a common operation in any research that includes text analysis, such as information retrieval. I find myself needing to extract n-grams quite often and, when I do, I often use Lucene for the processing. Since this is a common operation, I wrote a class to simplify the process and that can easily be used for n-gram extraction. In this post, I will briefly describe how the class can be used for n-gram extraction. The reader interested in how the actual code works should just read the source – it’s pretty well documented and relatively simple to understand.

Using the n-gram extractor is relatively simple. Begin by initializing a new NGramExtrator object:

NGramExtractor extractor = new NGramExtractor();

Then, use the extractor to extract text:

extractor.extract("please extract n-grams, ok thanks, ok thanks", 2, true, true);

The extract method takes 4 arguments:

String text: the text that you want to extract n-grams from
int length: the length of the n-grams
Boolean stopWords: whether or not stop words should be included in the n-grams (this is useful for information retrieval)
Boolean overlap: whether or not the n-grams should overlap, i.e. “a rose is a rose”, length = 2. With overlap n-grams are {“a rose”, “rose is”, “is a”, “a rose”}. Without overlap n-grams are {“a rose”, “is a”}

The extract method does not actually return anything, to get the n-grams you can use one of the following methods:

LinkedList<String> ngrams = extractor.getNGrams();
LinkedList<String> uniqueNgrams = extractor.getUniqueNGrams();

The first method above returns a LinkedList of all extracted n-grams while the second only returns the unique ngrams, i.e. if an n-gram occurs more than once, it only returns the first occurrence. Furthermore, the LinkedLists preserve the order in which the n-grams occur in the text.

Lastly, you can get the frequency of any n-gram using the following method:

extractor.getNGramFrequency(String ngram)

To compile you need to link against the Lucene libraries, which can be downloaded from the Lucene website. Specifically, there are two jar files that you should be interested in: lucene-core.jar and lucene-analyzers.jar

To compile a driver class called on my Ubuntu Linux computer I issue the following command (in the same directory where is):

javac -cp lucene-core-3.6.1.jar:lucene-analyzers-3.6.1.jar:.

Pay special attention to the -cp (classpath), which tells java where to find the required Lucene jars. Running the program is then simply a case of:

java -cp lucene-core-3.6.1.jar:lucene-analyzers-3.6.1.jar:. NGramDriver

An example driver class is shown below:

import java.util.LinkedList;

public class NGramDriver{

    public static void main (String [] args){
            NGramExtractor extractor = new NGramExtractor();
            extractor.extract("please extract n-grams, ok thanks, ok thanks", 2, true, true);
            LinkedList<String> ngrams = extractor.getNGrams();
            for (String s : ngrams){
                System.out.println("Ngram '" + s + "' occurs " + extractor.getNGramFrequency(s) + " times");
        catch (Exception e){

The output from running this code is:

Ngram 'please extract' occurs 1 times
Ngram 'extract n' occurs 1 times
Ngram 'n grams' occurs 1 times
Ngram 'grams ok' occurs 1 times
Ngram 'ok thanks' occurs 2 times
Ngram 'thanks ok' occurs 1 times
Ngram 'ok thanks' occurs 2 times


The code is available on GitHub here and a driver is here. Lastly, JavaDocs for the NGramExtractor class are available here.

ubuntu global jam cape town

September 5th, 2011 No comments


Ubuntu Global Jam

Ubuntu Global Jam

The Ubuntu Global Jam took place around the world this weekend. The Ubuntu Global Jam is an international event in which volunteers from LoCos (Ubuntu Local Committees) all around the world get together to work on improving Ubuntu. The purpose of the event is to encourage people to get involved with Ubuntu on a large scale and learn about ways in which the can contribute to Ubuntu. For instance, people can learn about and work on bug fixing, translation, documentation, testing, packaging or anything else that interests them.

The Ubuntu-ZA South African LoCo team hosted a bug fixing event at the Yola offices in Cape Town as part of the Ubuntu Global Jam, with a focus on fixing broken packages in the upcoming Ubuntu Oneiric, set to be released in October this year.

Since many of us were from a scientific background we decided to focus on fixing broken scientific packages, though that didn’t really happen and we just ended up fixing things that sounded cool or that we were specifically interested. For instance, given that I am a huge KDE fan, I tried to focus on fixing KDE packages, though I did fix a few others.

Bug Fixers

Fixing Bugs

We ended up being quite a small group with Stefano Rivera (tumbleweed), a Debian Developer and Ubuntu Master of the Universe (MOTU) leading the group, Michiel Baird who (I think) had a bit of experience fixing Ubuntu Bugs and Marco Gallotta and I who had no experience fixing Ubuntu bugs. Stefano was great in walking us through fixing broken packages – from identifying them, to figuring out what was wrong with them, fixing them and then contributing the fixes back to Ubuntu and in some cases Debian and upstream.

The great thing about the Global Jam is that it encourages everyone to get involved, regardless of their skill level. So even if you’ve never touched source code in your life, you can still help out by testing, writing documentation or translating the software. It was a great experience for me – I’ve always wanted to put my skills to use and get involved in helping to improve Ubuntu, but have never known where to start. The Cape Town Global Jam, under the lead of Stefano, was a great way to get stuck into it and has set the ball in motion for a lot more work to come!

Software Freedom Day

Get Involved!

On a similar note, remember that Software Freedom Day 2011 is less than two weeks away so be sure to get involved! The Cape Linux Users Group and the UCT Linux Enthusiasts Group will be hosting an event on Software Freedom Day at UCT, so check out the Facebook Event if you’re interested in being part of it, even if it’s just to pop in and say hi :)

first academic conference and presentation: jcdl 2010

June 30th, 2010 1 comment

I recently got back from Australia where I presented my first published conference paper at an academic conference. The conference was the ACM/IEEE Joint Conference on Digital Libraries (JCDL) which ran concurrently with the International Conference on Asian Digital Libraries (ICADL). I had one paper accepted at JCDL and another accepted at ICADL. The conferences took place in Surfer’s Paradise, Gold Coast, Australia.

The paper at JCDL was called Translating Handwritten Bushman Texts and is available via ACM here or via my institutional repository here. The paper for ICADL was called A Visual Dictionary for an Extinct Language and is available via Springer here or via my institutional repository here.

It was a great experience! Not only did I get to see and hear what cutting edge research was being performed by leading researchers in the field of digital libraries, but I also got to meet and interact with many of them and form some new connections at various Universities around the world. I also got to put faces to many of the papers I have been reading for most of this year :)

The main things that I took away from the conference were:

  • Computer science does not always need to be technical, but can also be philosophical and have social implications.
  • Researchers are generally interested in what others have to say.
  • Researchers contextualise the research of others and fit it in with their research.
  • Computer scientists are not simply nerds, but also like to have fun (but everyone already knows that)!

Unfortunately, I didn’t really have time to explore the Gold Coast since I was only there for 4 days, but I did get to go to the Outback Spectacular show (sort of Australia’s Wild West) and got to see family which recently immigrated to Australia a few years ago, as well as a cousin who lives in the United States, who I hadn’t seen in 7 years, and who just happened to be in Australia at the same time as me.

I would like to thank my supervisor A/Prof. Hussein Suleman for assisting me in writing the articles, as well as my co-authors Sanvir Manilal and Lebogang Molwantoa. I would also like to thank the Department of Computer Science at the University of Cape Town for funding my trip, and lastly, the JCDL and ICADL reviewers who liked my papers and got them accepted into the conferences.


November 12th, 2009 No comments

I just wrote a KDE service menu wrapper around the Dropbox python CLI. For more information please go here.

On another note, if you haven’t tried Dropbox I advise that you give it a go by clicking here :)

Categories: dropbox, gpl, kde, linux, programming

mencoder webcam floating point exception solution

October 23rd, 2009 No comments

I’ve spent most of the afternoon trying to figure out how I could use MEncoder to record video and audio from my webcam and kept running into a “floating point exception” problem. In scouring the internet I realised that many people seem to be having the exact same problem. It turns out that the problem is due to a bug in MEncoder and is simply fixed by installing the latest version.

I use Kubuntu Jaunty so the problem for me lay in the version of Mencoder that comes packaged with it. To fix the problem all I did was follow these instruction for adding a thrid party repository for newer builds of MPlayer/MEncoder and then installed the latest MEncoder using apt-get.

sudo apt-get install mplayer mencoder

Thereafter I was able to capture video and audio from my webcam using the following command:

mencoder tv:// -tv driver=v4l2:width=320:height=240:device=/dev/video0:forceaudio:adevice=/dev/dsp -ovc lavc -oac mp3lame -lameopts cbr:br=64:mode=3 -o webcam.avi

Categories: (k)ubuntu, linux, open source, tech

boldproject: bold translator overview

October 2nd, 2009 No comments

A few weeks ago I wrote a post which introduced the BOLD Project. Well, a lot has happened since then and this post gives an overview of the translation system which I am building.

BOLD Translator Overview

BOLD Translator Overview

The translator is split into three parts:

  1. The preprocessor
  2. The user input
  3. The matcher

The Preprocessor

The preprocessor is called as soon as an image is inserted into the repository. The preprocessor works by first segmenting the Bushman words in the dictionary. It does this by exploiting the known fact that every Bushman word on a page is underlined by a solid black line. Once the Bushman words have been segmented, specific features are extracted from them and these features are stored in inverted files. Once the features have been extracted then the orifinal image, the segmented words and the inverted files are all stored in the repository.

The User Input

The user who is accessing the Bleek & Lloyd notebooks uses a tool to select a specific word on a page which then becomes known as the key. The same features which were extracted from each word in the preprocessor are extracted from the key. These features along with the key image will be used later for matching.

The Matcher

The matcher starts by taking the features belonging to the key and finding images with the same features in the inverted files. For each feature match, the score of the image which matched increases. At the end of all the feature comparisons, the images with the highest scores are returned. At this stage there may be some images with the same or similar scores, so to resolve this clash the matcher performs a more intensive comparison between the key and the images with the highest score. Based on the result of this comparison, the most likely match is returned.

So that’s how the BOLD Translator works. Ultimately it is a framework which means that it will be designed such that anyone can adapt it and make use of it by plugging their own algorithms into each of the specific parts. In the next day or two I will blog about the actual work that has been done on the system up to now as well as show some of the results that the translator returns at this point.

boldproject: introducing the bold project

September 9th, 2009 No comments

BOLD LogoThe BOLD (Bushman On-Line Dictionary) Project is an honours project being worked on by myself and two colleagues, Sanvir Manilal and Lebogang Molwantoa and is supervised by Dr. Suleman. Together we are creating an online visual dictionary based on about 40 000 scanned images of dictionary pages which form part of the Bleek & LLoyd Collection. The dictionary pages contain an English word and the bushman translation(s) of the word. The goal of the project is to create a useable on-line visual dictionary which researchers around the world can make use of to find out more about bushman culture and bushman language.

The project has been split into three separate parts:

Part 1: Archive Management – Lebogang Molwantoa

This part involves the setting up and building of the archive, including developing administrative tools for managing the repository.

Part 2: Searching and Browsing – Sanvir Manilal

This part involves the way in which end users interact with and make use of the dictionary.

Part 3: Image Based Translation – Kyle Williams

This part involves using the Bushman dictionary to translate Bushman words in the existing Bleek & Lloyd Collection on the fly.

I’ll make use of this blog to provide updates on how development is going, as well as to document techniques I develop – with the idea being that:

  1. I document them for my own use
  2. They’re out there for use by other people who may be working on similar projects.

Wish us luck!

A page from the bushman dictionary

A page from the bushman dictionary*

* This image is not available under the same CC license as the rest of this blog. For more information about this image please visit