Natural Language Processing (NLP) Techniques for Extracting Information
"Cruising the Data Ocean" Blog Series - Part 4 of 6
This blog is a part of our Chief Architect's "Cruising the Data Ocean" series. It offers a deep-dive into some essential data mining tools and techniques for harvesting content from the Internet and turning it into significant business insights.
Once you have identified, extracted, and cleansed the content needed for your use case, the next step is to have an understanding of that content. In many use cases, the content with the most important information is written down in a natural language (such as English, German, Spanish, Chinese, etc.) and not conveniently tagged. To extract information from this content you will need to rely on some levels of text mining, text extraction, or possibly full-up natural language processing (NLP) techniques.
Typical full-text extraction for Internet content includes:
- Extracting entities – such as companies, people, dollar amounts, key initiatives, etc.
- Categorizing content – positive or negative (e.g. sentiment analysis), by function, intention or purpose, or by industry or other categories for analytics and trending
- Clustering content – to identify main topics of discourse and/or to discover new topics
- Fact extraction – to fill databases with structured information for analysis, visualization, trending, or alerts
- Relationship extraction – to fill out graph databases to explore real-world relationships
Follow 7 steps below to extract information using Natural Language Processing (NLP) techniques
STEP 1: The Basics
The input to natural language processing will be a simple stream of Unicode characters (typically UTF-8). Basic processing will be required to convert this character stream into a sequence of lexical items (words, phrases, and syntactic markers) which can then be used to better understand the content.
The basics include:
- Structure extraction – identifying fields and blocks of content based on tagging
- Identify and mark sentence, phrase, and paragraph boundaries – these markers are important when doing entity extraction and NLP since they serve as useful breaks within which analysis occurs.
- Language identification – will detect the human language for the entire document and for each paragraph or sentence. Language detectors are critical to determine what linguistic algorithms and dictionaries to apply to the text.
- Tokenization – to divide up character streams into tokens which can be used for further processing and understanding. Tokens can be words, numbers, identifiers or punctuation (depending on the use case)
- Basis Technology offers a fully featured language identification and text analytics package (called Rosette Base Linguistics) which is often a good first step to any language processing software. It contains language identification, tokenization, sentence detection, lemmatization, decompounding, and noun phrase extraction.
- Search Technologies has many of these tools available, for English and some other languages, as part of our Natural Language Processing toolkit. Our NLP tools include tokenization, acronym normalization, lemmatization (English), sentence and phrase boundaries, entity extraction (all types but not statistical), and statistical phrase extraction. These tools can be used in conjunction with the Basis Technology’ solutions.
- Acronym normalization and tagging – acronyms can be specified as “I.B.M.” or “IBM” so these should be tagged and normalized.
- Search Technologies’ token processing has this feature.
- Lemmatization / Stemming – reduces word variations to simpler forms that may help increase the coverage of NLP utilities.
- Lemmatization uses a language dictionary to perform an accurate reduction to root words. Lemmatization is strongly preferred to stemming if available. Search Technologies has lemmatization for English and our partner, Basis Technologies, has lemmatization for 60 languages.
- Stemming uses simple pattern matching to simply strip suffixes of tokens (e.g. remove “s”, remove “ing”, etc.). The Open Source Lucene analyzers provide stemming for many languages.
- Decompounding – for some languages (typically Germanic, Scandinavian, and Cyrillic languages), compound words will need to be split into smaller parts to allow for accurate NLP.
- For example: “samstagmorgen” is “Saturday Morning” in German
- See Wiktionary German Compound Words for more examples
- Basis Technology's solution has decompounding.
- Entity extraction – identifying and extracting entities (people, places, companies, etc.) is a necessary step to simplify downstream processing. There are several different methods:
- Regex extraction – good for phone numbers, ID numbers (e.g. SSN, driver’s licenses, etc.), e-mail addresses, numbers, URLs, hashtags, credit card numbers, and similar entities.
- Dictionary extraction – uses a dictionary of token sequences and identifies when those sequences occur in the text. This is good for known entities, such as colors, units, sizes, employees, business groups, drug names, products, brands, and so on.
- Complex pattern-based extraction – good for people names (made of known components), business names (made of known components) and context-based extraction scenarios (e.g. extract an item based on its context) which are fairly regular in nature and when high precision is preferred over high recall.
- Statistical extraction – use statistical analysis to do context extraction. This is good for people names, company names, geographic entities which are not previously known and inside of well-structured text (e.g. academic or journalistic text). Statistical extraction tends to be used when high recall is preferred over high precision.
- Phrase extraction – extracts sequences of tokens (phrases) that have a strong meaning which is independent of the words when treated separately. These sequences should be treated as a single unit when doing NLP. For example, “Big Data” has a strong meaning which is independent of the words “big” and “data” when used separately. All companies have these sorts of phrases which are in common usage throughout the organization and are better treated as a unit rather than separately. Techniques to extract phrases include:
- Part of speech tagging – identifies phrases from noun or verb clauses
- Statistical phrase extraction - identifies token sequences which occur more frequently than expected by chance
- Hybrid - uses both techniques together and tends to be the most accurate method.
STEP 2: Decide on Macro versus Micro Understanding
Before you begin, you should decide what level of content understanding is required:
Macro Understanding – provides a general understanding of the document as a whole.
- Typically performed with statistical techniques
- It is used for: clustering, categorization, similarity, topic analysis, word clouds, and summarization
Micro Understanding – extracts understanding from individual phrases or sentences.
- Typically performed with NLP techniques
- It is used for: extracting facts, entities (see above), entity relationships, actions, and metadata fields
Note that, while micro understanding generally contributes to macro understanding, the two can be entirely different. For example, a résumé (or curriculum vitae) may identify a person, overall, as a Big Data Scientist [macro understanding] but it can also identify them as being fluent in French [micro understanding].
STEP 3: Decide If What You Want is Possible (Within A Reasonable Cost)
Not all natural language understanding (NLP) projects are possible within a reasonable cost and time. After having done numerous NLP projects, we’ve come up with a flowchart to help you decide if your requirements are likely to be manageable with today’s NLP techniques.
STEP 4: Understand the Whole Document (Macro Understanding)
Once you have decided to embark on your NLP project, if you need a more holistic understanding of the document this is a “macro understanding.” This is useful for:
- Classifying / categorizing / organizing records
- Clustering records
- Extracting topics
- General sentiment analysis
- Record similarity, including finding similarities between different types of records (for example, job descriptions to résumés / CVs)
- Keyword / keyphrase extraction
- Duplicate and near-duplicate detection
- Summarization / key sentence extraction
- Semantic search
In this architecture, content is downloaded from the internet or external sources (by connectors), then written to Kafka Queues and processed by Spark Machine Learning. The results are written to databases or to a search engine to be used by end-user applications.
Note that “Text Processing Libraries” will need to be included in this architecture to handle all of the basic NLP functions described above in “STEP 1: The Basics.” This can include multiple open source projects working together, or one or two vendor packages.
Algorithms in Spark MLlib which are helpful for macro understanding include:
- Vectors – sparse vectors hold a list of weighted unique words or phrases in the document. Weights can be determined using TF/IDF or other term statistics (such as position in document, term statistics from other corpora or data sets) and then normalized
- Word2Vec – computes intelligent vectors for all terms, such that similar terms have similar vectors. It can be used to find synonyms and semantically similar words.
- Dimensionality Reduction – (typically, Singular Value Decomposition – SVD) used to reduce arbitrary N-length vectors into fixed vector lengths that are more amenable to classification.
- DIMSUM – compares all vectors within a set to all other vectors in a set using a smart pruning algorithm. Comparisons are performed with cosine similarity.
- Nearest Neighbor – a classification technique to compare vectors to sample vectors from a training set. The most similar vector (the nearest neighbor) would be used to classify the new record.
- Classification Algorithms – (Decision Tree, Random Forest, Naïve Bayes, Gradient Boosted Trees) can be used to classify or categorize documents to a training set; may require that dimensions are reduced using SVD
- Clustering Algorithms – (K-Means [several types], LDA, PIC) identify clusters of related documents and/or extract topics from the content set. This can be used to research the types of records in a content set or identify similar sets of documents. Note that it may be also possible to cluster users based on the types of records they like.
- Logistic Regression – combine multiple document statistics and vector comparisons into a single formula for classifying a document.
STEP 5: Extracting Facts, Entities, and Relationships (Micro Understanding)
Micro understanding is the extracting of individual entities, facts or relationships from the text. This is useful for (from easiest to hardest):
- Extracting acronyms and their definitions
- Extracting citation references to other documents
- Extracting key entities (people, company, product, dollar amounts, locations, dates). Note that extracting “key” entities is not the same as extracting “all” entities (there is some discrimination implied in selecting what entity is ‘key’)
- Extracting facts and metadata from full text when it’s not separately tagged in the web page
- Extracting entities with sentiment (e.g. positive sentiment towards a product or company)
- Identifying relationships such as business relationships, target / action / perpetrator, etc.
- Identifying compliance violations, statements which show possible violation of rules
- Extracting statements with attribution, for example, quotes from people (who said what)
- Extracting rules or requirements, such as contract terms, regulation requirements, etc.
Micro understanding must be done with syntactic analysis of the text. This means that order and word usage are important.
There are three approaches to performing extraction that provides micro understanding:
1. Top Down – determine Part of Speech, then understand and diagram the sentence into clauses, nouns, verbs, object and subject, modifying adjectives and adverbs, etc., then traverse this structure to identify structures of interest
- Advantages – can handle complex, never-seen-before structures and patterns
- Disadvantages – hard to construct rules, brittle, often fails with variant input, may still require substantial pattern matching even after parsing.
Sample top-down output from Google Cloud Natural Language API
(Right-click on the image and select "Open image in new tab" for better image clarity)
In the deep understanding graph, notice how all of the modifiers are linked together. Also notice that a second step (which requires custom programming) is required to take this graph and identify object / action relationships suitable for exporting to a graph or relational database.
2. Bottoms Up – create lots of patterns, match the patterns to the text and extract the necessary facts. Patterns may be manually entered or may be computed using text mining.
- Advantages – Easy to create patterns, can be done by business users, does not require programming, easy to debug and fix, runs fast, matches directly to desired outputs
- Disadvantages – Requires on-going pattern maintenance, cannot match on newly invented constructs
3. Statistical – similar to bottoms-up, but matches patterns against a statistically weighted database of patterns generated from tagged training data.
- Advantages – patterns are created automatically, built-in statistical trade-offs
- Disadvantages – requires generating extensive training data (1000’s of examples), will need to be periodically retrained for best accuracy, cannot match on newly invented constructs, harder to debug
The following are sample patterns used by the bottoms-up or statistical approaches
Note that these patterns may be entered manually, or they may be derived statistically (and weighted statistically) using training data or inferred using text mining and machine learning.
Development frameworks for NLP:
- Open NLP – has many components; is complex to work with; parsing is done with the “top down” approach
- UIMA – has many components and statistical annotation; tends to require a lot of programming; lends itself to a bottoms-up / statistical approach, but not easily implemented
- GATE – configurable bottoms-up approach; is much easier to work with, but configurations must still be created by programmers (not business users)
- Search Technologies’ Natural Language Processing framework – bottoms-up approach scaled to very large sets of patterns. Patterns can be created by business users. Our framework is expected to include statistical patterns from training sets. This is in development.
Service frameworks for NLP:
- IBM Cognitive – statistical approach based on training data
- Google Cloud Natural Language API – top-down full-sentence diagramming system
- Amazon Lex – geared more towards human-interactive (human in the loop) conversations
Some tricky things to watch out for:
- Co-reference resolution - sentences often refer to previous objects. This can include the references below. In all of these cases, the desired data refers to a previous, more explicitly defined entity. To achieve the highest possible coverage, your software will need to identify these back references and resolve them.
- Pronoun reference: “She is 49 years old.”
- Partial reference: “Linda Nelson is a top accountant working in Hawaii. Linda is 49 years old.”
- Implied container reference: “The state of Maryland is a place of history. The capital, Annapolis, was founded in 1649.”
- Handling lists and repeated items
- For example: “The largest cities in Maryland are Baltimore, Columbia, Germantown, Silver Spring, and Waldorf.”
- Such lists often break NLP algorithms and may require special handling which exists outside the standard structures.
- Handling embedded structures such as tables, markup, bulleted lists, headings, etc.
- Note that structure elements can also play havoc with NLP technologies.
- Make sure that NLP does not match sentences and patterns across structural boundaries. For example, from one bullet point and into the next.
- Make sure that markup does not break NLP analysis where it shouldn’t. For example, embedded emphasis should not cause undue problems.
STEP 6: Maintain Provenance / Traceability
At some point, someone will point to a piece of data produced by your system and say: “That looks wrong. Where did it come from?”
Acquiring content from the Internet and then extracting information from that content will likely involve many steps and a large number of computational stages. It is important to provide traceability (provenance) for all outputs so that you can carefully trace back through the system to identify exactly how that information came to be.
This usually involves:
- Save the original web pages which provided the content
- Save the start and end character positions of all blocks of text extracted from the web page
- Save the start and end character positions for all entities, plus the entity ID and entity type ID matched
- Save the start and end character positions for all patterns matched, plus the pattern ID and sub-pattern IDs (for nested or recursive patterns)
- Identify other cleansing or normalization functions applied / used by all content
By saving this information throughout the process, you can trace back from the outputs all the way back to the original web page or file which provided the content that was processed. This will allow you to answer the question “Where did this come from?” with perfect accuracy, and will also make it possible to do quality analysis at every step.
STEP 7: Human-Aided Processes
Note that content understanding can never be done without some human intervention somewhere:
- For creating or cleansing or choosing lists of known entities
- For evaluating output accuracy
- To discover new patterns
- To evaluate and correct outputs
- To create training data
Many of these processes can be mind-numbingly repetitive. In a large-scale system, you will need to consider the human element and build that into your NLP system architecture.
Some options include:
- Creating user interfaces to simplify and guide the human evaluation process, for example, allowing users to easily tag entities in content using a WYSIWYG tool and providing easily editable lists to review (with sortable statistics and easy character searching)
- Leveraging crowd-sourcing to scale out human-aided processes, for example, using CrowdFlower
- Finding ways to incorporate human review / human-in-the-loop as part of the standard business process, for example, pre-filling out a form using extracted understanding and having the employee review it before clicking “save” and uploading new content
Once you have extracted information using NLP techniques, how do you use the results for your business needs? I'll discuss this step in my next post.
If you are working on your NLP project and want to learn more about how we can help you leverage these tools and techniques, contact us for further discussions.
We’re very excited to announce that we’re now part of Accenture! Read the announcement here.