Friday, October 10, 2014

tf-idf

Term frequency-inverse document frequency (tf-idf) is a measure of word importance in a document within a corpus. Words with the highest tf-idf often best characterize the topic of the document.

Words appearing most frequently in the corpus are not the most important words as might be expected. They are common words like stop-words. Rare words are actually the best indicators of importance, especially if they appear multiple times.

tf-idf can be represented by the following equation (thanks Online LaTeX Equation Editor):


Term frequency is calculated as the frequency of word i in document j is divided (normalized) by the maximum frequency of any word k in document j. Inverse document frequency, which accounts for words that are just more common, is calculated as the number of documents N divided by the number of those documents n the word i appears in, and then scaling that by taking the logarithm (the base of the log function does not matter).

This topic has come up in a couple of Coursera classes I have looked at--Web Intelligence and Big Data and Mining Massive Datasets--in the context of a search engine. Basically, you view each document and query (short document) as a vector of tf-idf scores, then you can find the most similar ones using cosine similarity as a way to rank the search results. Inverted indexes allow us to pre-compute much of the tf-idf score.

UPDATE:

scikit-learn has an tf-idf usage example at Clustering text documents using k-means.