Back to top

Vocabulary-Based Entity Extraction

Perhaps the most straight-forward method of metadata enrichment is entity extraction, based on a dictionary of terminology. This is sometimes also referred to as named-entity recognition.

This approach requires an explicit list of terms, which the extraction algorithm uses to identify occurrences in documents. Terms can be single words or phrases. Where occurrences are found, metadata is created to attach to the document.

The vocabulary used can be as simple as a small list of companies or products. It can also be very large, for example, a gazetteer of all known place names with a country.


Entity lists sometimes include synonyms and abbreviations, for example, to map IBM and "International Business Machines" together - in other words normalizing the metadata describing that corporate entity.


In some applications, disambiguation is important. In many languages, and especially in English, words exist that have multiple meanings. The human brain is able to decipher the intended meaning from the context in which the word is used. Computer algorithms can, to an extent, emulate this process, through taking into account surrounding words.

A vocabulary-based approach to entity extraction is most commonly applied using lists of industry specific, or even company-specific terms such as products, departments, events or processes.

Where entities such as email addresses, telephone numbers and people are concerned, the pattern-based approach is usually preferred.