Back to top

Pattern-Based Entity Extraction

Although pattern-based entity extraction can be applied to a wide variety of recognition tasks, it is most commonly used to identify standard entities such as dates, email addresses, telephone numbers, zip codes and the names of people. This form of extraction is sometimes referred to as the recognition of Pattern Identified Entities.

In some cases, pattern matching is supported by a vocabulary.

Each of these entities conforms to set patterns. Dates, for example, can take the forms DD/MM/YYYY, or MM/DD/YYYY, or MM/DD/YY or DD MMM YYYY (such as 19 SEP 2013). 

This technique works by comparing such a portfolio of patterns with the contents of documents. In some cases, a vocabulary is used to further filter the results.

In addition to the extraction of standard entities, this approach can be used for other items which conform to specific patterns. For example, part numbers from a catalog, or vehicle registration plates. 

The technique used for pattern-matching is also called "regular expression matching," or REGEX for short.

The most common use of entity extraction in enterprise search systems is to provide metadata to drive user interface functions such as search navigation, and alternative results sorting methods. Enterprise search systems typically require a combination of standard and customized pattern extraction techniques.

Search Technologies' Aspire content processing framework provides a range of tools and approaches for entity extraction. Aspire can be used with any of the leading search engines, including SharePoint, the Google Search Appliance, and Apache Solr.