Deciding If A Natural Language Processing (NLP) Project Is Feasible
In our technical deep-dive blog, we discuss some essential Natural Language Processing (NLP) tools and techniques for improving query understanding. But not all NLP projects are feasible within a reasonable cost and time. After having done numerous NLP projects, Search Technologies has developed a flowchart to decide whether your requirements are likely to be manageable with today’s NLP tools and techniques.
Follow the flowchart and detailed instructions below (for better image clarity, right-click on the image and select "Open image in new tab"):
Is > 80% accuracy required?
- Accuracy in this context is the percentage of records where the correct answer is produced.
- High accuracy systems generally require much more work to handle many more text variations, and the work gets harder and harder, especially above 80%.
- Lower accuracy systems are often still useful for large-scale analytics and trend analysis.
Is > 80% coverage required?
- Coverage in this context is the percent of applicable records for which an answer is provided.
- An “applicable record” is a record which contains text that provides the desired understanding.
- High coverage systems generally require much more work to handle many more text variations, and the work gets harder and harder above 80% coverage.
Can you afford substantial time and effort?
- Of course, “substantial” is relative but is generally many months’ worth of work.
- Note: Search Technologies can evaluate your content and requirements and provide more precise estimates.
Macro or Micro understanding?
- See a description of the difference between micro and macro understanding here.
Is training data available?
- Training data is typically required to train statistical models for many types of understanding.
- Training data may already be available if:
- The system is replacing a process which was previously done manually
- The system is filling gaps for manually entered values, for example, by end users filling out forms
- Public training data is available
- Log data from end-user interactions can sometimes be used to infer training data
- Appropriate third party metadata is available for a portion of the content
Can you afford to manually create training data?
- If training data is not available and cannot be inferred, then it will need to be created manually for most types of macro understanding.
- Depending on the scope of the project, this can be done by just a few people, or it could require a larger team perhaps even using a crowd-sourcing model.
Is the text short or long?
- Generally, short text contains fewer variations and complex sentence structures and is, therefore, easier to process for micro understanding.
Is the text fairly regular / narrow domain?
- This question often has more to do with the authors of the text than the text itself.
- If the authors are similar across the board, then they will typically produce fairly regular text that spans a fairly narrow domain.
- Examples include employees, airline pilots, java computer programmers, maintenance engineers, librarians, users trained on a certain product, contract lawyers, etc.
- On the other hand, if the authors cover a wide range of backgrounds, education levels, and language skills, then typically they will produce a wide range of text variation across a wide domain. This will be difficult to process.
Is the text academic, journalistic, or narrative?
- Text which is written by professional writers tends to be longer, more varied, and have more complex sentence structure, all of which is hard to understand by machine.
Is there a human in the loop?
- In some applications, there will be human review of the results. Such applications will generally be more tolerant of errors produced by the understanding algorithms.
- For example, a system may extract statements which indicate compliance violations. These would, of necessity, need to be checked by a compliance officer to determine if a rule was actually violated.
Is entity extraction the only requirement?
- Applications that only extract entities are much easier to create than extracting more sophisticated understanding, such as facts, sentiment, relationships, or multiple coordinated metadata values.
Do you have known entities?
- Known entities come from entity lists that have been previously gathered ahead of time. These can be things like employees (from an employee directory), office locations, public companies, countries, etc.
- Unknown entities are not previously known to the system. These can include people names, company names, locations, etc. Unknown entities can only be determined by looking at names in context.
Is entity-tagged text available?
- Unknown entities will need to be determined based on context (e.g. the words around the entity). This can be done statistically if sufficient tagged examples exist.
- Sometimes (rarely) entity-tagged text can come from public sources. Other times it may come from tagging in the content (e.g. embedded HTML markup) or by watching the user interact with cut and paste.
Can you afford to manually create entity-tagged examples?
- If entity-tagged text is not available, then it may need to be manually created. This can be an expensive process, since, usually 500+ examples are required to achieve good accuracy. To achieve > 80% accuracy, as many as 2,000 to 5,000 examples may be required for good accuracy.
The flow chart above is a general guideline you can use to evaluate your NLP project's feasibility. For more information about this flow chart or how our consultants can help assess and implement your NLP requirements, contact us.