Apache Solr provides a set of very simple methods to access term frequency (TF), document frequency (DF), TF-IDF and other statistics for an indexed collection by using FunctionQueries. While FunctionQueries are documented online, it isn’t obvious how to use them, so here’s a short cheatsheet for Solr 4.
Queries should be formulated as normal for Solr, for each result returned the desired statistics will be part of the returned document. For instance, if you’re interested in term frequency of a specific term then each result will contain the frequency of that term in the relevant document. Similarly, if you’re interested in the number of documents in the index, then each result will contain the number of documents. Thus, when you’re interested in document specific queries you should make sure that the documents you’re interested in are matched by the query. Similarly, if you’re only interested in collection statistics, then it is sufficient for only one document in the collection to match your query.
Queries are in the form of:
All documents that match queryTerm will be returned and, for each result, the fields specified in fl= will be returned. In the example above, we return the document id and the output of some functionQuery.
Term Frequency of “solr” in “text” field:
This will return all documents that contain the term solr as well as the frequency of the term “solr” in the “text field
Document Frequency of “solr” in “text” field
We query with * since we don’t mind what matches, the document frequency is specified with the docfreq() function query and we limit the number of rows to 1 since document frequency of a term is result independent.
Inverse Document Frequency of “solr” in “text” field
The same idea as docfreq.
There are many function queries supported by Solr as listed on the Wiki page for function queries. I personally use the DF function query to be able to quickly get document frequencies when I need to do TF-IDF weighting in other applications since all you need to do is index the documents and then let Solr do the rest.