Document Similarity Analysis for Search & Big Data Applications
Whole document comparison techniques support a range of useful applications.
SIMILARITY VS. EXACT
Identical duplicate documents are generally very easy to detect, for example, using a simple hash algorithm. However, finding documents that are similar, or near-duplicates, requires more effort.
Typically, the first step is to extract a document vector to represent the document as a whole. This can be done through a statistical approach, in which case the vector will be made from the statistically most important words contained in the document. The importance of terms is usually weighted according to their popularity in the data set as a whole, the effect being to prioritize terms that are relatively rare in the corpus.
Alternatively, vocabularies can be used to guide the formation of the document vector. This tends to be the best approach where the data set is focused on a specific topic. The narrowness of the subject matter makes it practical to build and maintain a suitable vocabulary.
Having extracted and stored the document vector as metadata, similarity analysis applications work by comparing the vectors of documents, and for this, a range of statistical approaches can be used.
SIMILARITY ANALYSIS APPLICATIONS
Useful applications can be found in both search and business insight / big data applications. For example:
- Near-duplicate detection to improve search results quality
- Human Resources applications, such as automated CV to job description matching, or finding similar employees
- Patent research, through matching potential patent applications against a corpus of existing patent grant. See CPA Global Case Study
- Document clustering and auto-categorization using seed documents
- Security scrubbing - finding documents with very similar content, but with different access control lists
Aspire, a java-based content processing system for the analysis of unstructured content, supports all of the above techniques.
For further information, or a discussion of your ideas, contact us.