In order to find these lexically similar documents, you need to represent each document as a set of sentences or phrases. One technique for this is called shingling and a k-shingle is any substring with length k in the document. The value of k needs to be large enough that the probability of a shingle appearing in any given document is low. To represent shingles more concisely, use a hash function to hash the substrings to a number. Then, for example, each shingle is represented as four bytes instead of a k byte substring.
Even with four byte shingles the set for a document still takes up more space than the document itself. We need to replace these large sets with smaller representations called signatures. A signature is the result of several hundred minhash calculations. Consider a matrix where there is a row for each shingle and a column for every document sets. The matrix has a 1 for a particular row and column if that shingle appears in the set. Pick a permutation of rows and then the minhash value for a column is the number of the first row (in permuted order) with a 1 in that column.
In practice, storing the sparse matrix and permuting and sorting millions of rows is too time-consuming, but it can be simulated by random hash functions that map row numbers to as many buckets as there are rows (and ignoring the small number of collisions). Instead of picking n random permutations you pick n randomly chosen hash functions. If a column has a 1 for row r, then the corresponding signature matrix column is updated with the bucket numbers as long as they are less than the current values.
From the signature matrix we can estimate the Jaccard similarities of the underlying sets. Mind = blown.
It still might not be possible to efficiently check the similarity of all pairs of documents because there are just too many pairs. To find the most similar pairs or all pairs above a certain similarity threshold, we need to focus only on pairs that are likely to be similar. The LSH approach here is to hash items several times and hope that similar pairs will end up in the same bucket at least once. Any pair that hashed to the same bucket in any of the hashings is a candidate pair (even though some of them are false positives).
Divide the signature matrix into b bands where each band consists of r rows. Hash each vector (the portion of each column in a band) to a large number of buckets. Each band has its own buckets. The similarity threshold is an S-curve and is approximately:
Finally, examine the candidate pairs to determine if they are above the previously determined similarity threshold.
Besides the minhash family of functions, there are also other function families that apply to other distance measures like Hamming distance or edit distance. LSH works well for fingerprint matching and entity-resolution.
UPDATE:
Spark implements several LSH operations and has more info on how they can be used.
Resources:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
Resources:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf