Question: Understanding the different methods one could apply to term extraction will give you insight into your own work. There are lots of questions to ask.
Understanding the different methods one could apply to term extraction will give you insight into your own work. There are lots of questions to ask. Do I want just single words, also known as a bag of words? Or do I wish to extract phrases and words? Or only phrases over a certain size ie Only threeword phrases, etc. Where do I draw the line between term extraction and entity extraction?
To gain this insight, we will compare different ways to tokenize the text and perform term extraction using online term extractors and python tools. First, you will manually perform this building your ground truth then we will experiment with both online tools and python tools.
This is a great articleLinks to an external site. on comparing the speed of different NLP tokenizers.
Compare Manual, Programmatic, and Online Term Extraction:
Step : Manual Term Extraction
Step : Automated Term Extraction
Step : Programmatic Term Extraction
Step : Identify Terms that Represent the Same "Person, Place, Thing, Organization"
Where to Find the Online Term Extraction Engines
Please review the following URLLinks to an external site. for a list of term extractors and pick from the list use MonkeyLearn.
Previous classes have had success with FiveFilterLinks to an external site..
Note differences between MonkeyLearn and the other term extractors: MonkeyLearn is geared toward keyword extractions.
Once you have used different term extractors, we will try entity extractors.
DandelionLinks to an external site.
wordcount.comLinks to an external site.
What to Do
Step : Manual Term Extraction
Using EACH of the documents that you selected previously for both datasets manually perform term extraction and entity extraction. What is an entity? What is a concept? If you are asking yourself these questions, think Person, Place, Organization, or Thing. Typically this is what defines your entities. Named entities are the proper forms of these entities more soon
In terms of concepts, a lot of my research relates to this in fact I spent a good portion of my dissertation studying this topic You can read about this work here on UMBC EquityLinks to an external site.. I describe domain concepts as the concepts that are most describing or representative of a domain or a discipline For example, in the domain of "Climate Change", the concept "Black Carbon" is significant. Identifying "Black" and "Carbon" separately is not sufficient in our goal of extracting knowledge from text, so "Black Carbon" was identified as a concept that is important to the "Climate Change" domain. Therefore term extraction is guided by these concepts. In NL text, lots of entities can be found, but not all of them are relevant to the domain. This is where domain concepts become important. However, you may also include terms that are not necessarily domain concepts and not necessarily proper nouns.
In this exercise, you are creating your groundtruth for the upcoming work in later weeks. You are defining what are the important terms and entities for this document collection. You are also defining which of the entities realized in the text represent the same concept ie "President Obama" and "Barack Obama"
To support this task, create a table for each dataset with the following column structure:
Column : The term or noun phrase which you extracted manually, and
Column : The number of times the term occurred in your Document
Column : The number of times the term occurred in your Document
Column : The number of times the term occurred in your Document
Column : The number of times the term occurred in your Document
Column : The number of times the term occurred in your Document
Column : The number of times the term occurred in your Document
Column : The number of times the term occurred in your Document
At this stage, you are creating a ground truth; that which you believe to be sort of true about your documents.
Now, build on your Step groundtruth by comparing your results with those of two easily accessible online term extraction engines:
Step : Run Two Different Automated Term Extraction Engines.
Important Note: You don't have to complete this exercise in full for every single term or term phrase in each of your documents. You can decide to focus on the top twenty or so terms that you obtained via ground truth. However, you will gain greater insight if you can push through and do as many terms as possible; this will give you a greater understanding of how different term engines operate. Definitely don't miss the entities and domain concepts
Add the following columns to your table:
use term extractor for Document store metrics ie the number of times the term occurred in the document, word counts, etc
use term extractor for Document store metrics ie the number of times the term occurred in the document, wor
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
