- Even though lexical diversity (unique tokens / total number of tokens) and term frequency distributions are simple, they are still important and useful to start with
- The Natural Language Toolkit (NLTK) is a popular Python module for NLP
- Microformats and HTML 5's microdata are ways of decorating markup to expose structured information
- CouchDB can be used to build up indexes on data and perform frequency analysis through MapReduce operations
- Add Lucene to enable full-text searching of CouchDB documents
- I've known Redis as a key-value store or cache, but it's also known as a data structure server because it can contain lists, sets, hashes, etc.
- When analyzing a graph (like Twitter followers), a graph database can help by providing common operations like clique detection or breadth-first search
- There are many visualization tools besides matplotlib and Graphviz available from Python like Ubigraph, Protovis, and SIMILE Timeline
- Edit distance (aka Levenshtein distance) is a measure of how many changes it would take to convert one string to another
- n-gram similarity is a measure of common n-grams between samples
- Jaccard index measures the similarity of two sets (|A ∩ B| / |A ∪ B|)
- Calculating the distance between every pair for clustering a large n can be impossible (I think the book could have gone into more detail here and mentioned an alternative approach like what I wrote about at Locality Sensitive Hashing) but k-means clustering at O(kn) can approximate well
- Two visualizations I recognized but didn't know by name: Dorling Cartograms and dendrograms
- New (to me) visualization for trees: radial trees and sunburst visualizations
- Natural language frequency analysis follows Zipf's Law (a power law and long tail distribution) meaning a word's frequency is inversely proportional to its rank in the frequency table
- TF-IDF is one of the fundamental information retrieval techniques for retrieving documents from a corpus (I wrote about it at tf-idf)
- A common way to find similar documents is cosine similarity where the vectors are TF-IDF weights
- Document similarities can be visualized with arc and matrix diagrams
- Much information is gained when you can look at multiple tokens at a time, like bi-grams (2-grams)
- Collocations are sequences of words that occur together often
- Contingency tables are data structures for expressing frequencies associated with the terms of a bi-gram
- Dice's coefficient, likelihood ratio, chi-square, and Student's t-score, in addition to Jaccard index, are all statistical approaches that can be used for discovering collocations
- Stemming and lemmatization
- Stop-words
- A typical NLTK NLP pipeline is:
- end of sentence (EOS) detection
- tokenization
- part-of-speech tagging
- chunking - assembling compound tokens for logical concepts
- extraction - tagging chunks as named entities
- Filtering out sentences containing frequently occurring words appearing near each other is a basic way to summarize documents
- Extracting entities from documents can address some of the shortcomings of the bag-of-words approach TF-IDF (like homographs and different capitalizations), which n-grams don't completely solve
- Use the F1 score to measure accuracy against manually tagged documents
- Facebook's Open Graph Protocol enables you to turn any web page into a social graph by injecting RDFa metadata into the page
- The semantic web, if realized through standards like RDF and OWL, would be a domain-agnostic way to enable machines to understand and use web information
Thursday, August 27, 2015
30 ideas sort of related to NLP
Over the past year or so, as I was trying to learn more about machine learning, one related topic I haven't gotten to is natural language processing (NLP). I've also had Matthew Russell's Mining the Social Web sitting unread on my bookshelf for a while. Even though it's a bit outdated at this point with references to Google Buzz (looks like there is an updated edition available though) I think it will be good for picking up some NLP basics. It's been described as a successor to Collective Intelligence, which I thought was a fantastic book, so I'm really been looking forward to having the time to finally get through it. This post is going to be lnotes of what I learn as I learn it.
Labels:
books,
graph theory,
machine learning,
nlp,
nosql,
statistics,
visualization