programming opiethehokie: August 2015

Over the past year or so, as I was trying to learn more about machine learning, one related topic I haven't gotten to is natural language processing (NLP). I've also had Matthew Russell's Mining the Social Web sitting unread on my bookshelf for a while. Even though it's a bit outdated at this point with references to Google Buzz (looks like there is an updated edition available though) I think it will be good for picking up some NLP basics. It's been described as a successor to Collective Intelligence, which I thought was a fantastic book, so I'm really been looking forward to having the time to finally get through it. This post is going to be lnotes of what I learn as I learn it.

Even though lexical diversity (unique tokens / total number of tokens) and term frequency distributions are simple, they are still important and useful to start with
The Natural Language Toolkit (NLTK) is a popular Python module for NLP
Microformats and HTML 5's microdata are ways of decorating markup to expose structured information
CouchDB can be used to build up indexes on data and perform frequency analysis through MapReduce operations
Add Lucene to enable full-text searching of CouchDB documents
I've known Redis as a key-value store or cache, but it's also known as a data structure server because it can contain lists, sets, hashes, etc.
When analyzing a graph (like Twitter followers), a graph database can help by providing common operations like clique detection or breadth-first search
There are many visualization tools besides matplotlib and Graphviz available from Python like Ubigraph, Protovis, and SIMILE Timeline
Edit distance (aka Levenshtein distance) is a measure of how many changes it would take to convert one string to another
n-gram similarity is a measure of common n-grams between samples
Jaccard index measures the similarity of two sets (|A ∩ B| / |A ∪ B|)
Calculating the distance between every pair for clustering a large n can be impossible (I think the book could have gone into more detail here and mentioned an alternative approach like what I wrote about at Locality Sensitive Hashing) but k-means clustering at O(kn) can approximate well
Two visualizations I recognized but didn't know by name: Dorling Cartograms and dendrograms
New (to me) visualization for trees: radial trees and sunburst visualizations
Natural language frequency analysis follows Zipf's Law (a power law and long tail distribution) meaning a word's frequency is inversely proportional to its rank in the frequency table
TF-IDF is one of the fundamental information retrieval techniques for retrieving documents from a corpus (I wrote about it at tf-idf)
A common way to find similar documents is cosine similarity where the vectors are TF-IDF weights
Document similarities can be visualized with arc and matrix diagrams
Much information is gained when you can look at multiple tokens at a time, like bi-grams (2-grams)
Collocations are sequences of words that occur together often
Contingency tables are data structures for expressing frequencies associated with the terms of a bi-gram
Dice's coefficient, likelihood ratio, chi-square, and Student's t-score, in addition to Jaccard index, are all statistical approaches that can be used for discovering collocations
Stemming and lemmatization
Stop-words
A typical NLTK NLP pipeline is:

end of sentence (EOS) detection
tokenization
part-of-speech tagging
chunking - assembling compound tokens for logical concepts
extraction - tagging chunks as named entities

Filtering out sentences containing frequently occurring words appearing near each other is a basic way to summarize documents
Extracting entities from documents can address some of the shortcomings of the bag-of-words approach TF-IDF (like homographs and different capitalizations), which n-grams don't completely solve
Use the F1 score to measure accuracy against manually tagged documents
Facebook's Open Graph Protocol enables you to turn any web page into a social graph by injecting RDFa metadata into the page
The semantic web, if realized through standards like RDF and OWL, would be a domain-agnostic way to enable machines to understand and use web information

programming opiethehokie

Pages

Thursday, August 27, 2015

30 ideas sort of related to NLP