Extracting n-grams from a piece of text is a common operation in any research that includes text analysis, such as information retrieval. I find myself needing to extract n-grams quite often and, when I do, I often use Lucene for the processing. Since this is a common operation, I wrote a class to simplify the process and that can easily be used for n-gram extraction. In this post, I will briefly describe how the class can be used for n-gram extraction. The reader interested in how the actual code works should just read the source – it’s pretty well documented and relatively simple to understand.
Using the n-gram extractor is relatively simple. Begin by initializing a new NGramExtrator object:
NGramExtractor extractor = new NGramExtractor();
Then, use the extractor to extract text:
extractor.extract("please extract n-grams, ok thanks, ok thanks", 2, true, true);
The extract method takes 4 arguments:
String text: the text that you want to extract n-grams from
int length: the length of the n-grams
Boolean stopWords: whether or not stop words should be included in the n-grams (this is useful for information retrieval)
Boolean overlap: whether or not the n-grams should overlap, i.e. “a rose is a rose”, length = 2. With overlap n-grams are {“a rose”, “rose is”, “is a”, “a rose”}. Without overlap n-grams are {“a rose”, “is a”}
The extract method does not actually return anything, to get the n-grams you can use one of the following methods:
LinkedList<String> ngrams = extractor.getNGrams();
LinkedList<String> uniqueNgrams = extractor.getUniqueNGrams();
The first method above returns a LinkedList of all extracted n-grams while the second only returns the unique ngrams, i.e. if an n-gram occurs more than once, it only returns the first occurrence. Furthermore, the LinkedLists preserve the order in which the n-grams occur in the text.
Lastly, you can get the frequency of any n-gram using the following method:
extractor.getNGramFrequency(String ngram)
To compile you need to link against the Lucene libraries, which can be downloaded from the Lucene website. Specifically, there are two jar files that you should be interested in: lucene-core.jar and lucene-analyzers.jar
To compile a driver class called NGramDriver.java on my Ubuntu Linux computer I issue the following command (in the same directory where NGramExtractor.java is):
javac -cp lucene-core-3.6.1.jar:lucene-analyzers-3.6.1.jar:. NGramDriver.java
Pay special attention to the -cp (classpath), which tells java where to find the required Lucene jars. Running the program is then simply a case of:
java -cp lucene-core-3.6.1.jar:lucene-analyzers-3.6.1.jar:. NGramDriver
An example driver class is shown below:
import java.util.LinkedList;
public class NGramDriver{
public static void main (String [] args){
try{
NGramExtractor extractor = new NGramExtractor();
extractor.extract("please extract n-grams, ok thanks, ok thanks", 2, true, true);
LinkedList<String> ngrams = extractor.getNGrams();
for (String s : ngrams){
System.out.println("Ngram '" + s + "' occurs " + extractor.getNGramFrequency(s) + " times");
}
}
catch (Exception e){
System.err.println(e.toString());
}
}
}
The output from running this code is:
Ngram 'please extract' occurs 1 times
Ngram 'extract n' occurs 1 times
Ngram 'n grams' occurs 1 times
Ngram 'grams ok' occurs 1 times
Ngram 'ok thanks' occurs 2 times
Ngram 'thanks ok' occurs 1 times
Ngram 'ok thanks' occurs 2 times
The code is available on GitHub here and a driver is here. Lastly, JavaDocs for the NGramExtractor class are available here.
Recent Comments