Question: Understanding the different methods one could apply to term extraction will give you insight into your own work. There are lots of questions to ask.

Understanding the different methods one could apply to term extraction will give you insight into your own work. There are lots of questions to ask. Do I want just single words, also known as a bag of words? Or do I wish to extract phrases and words? Or only phrases over a certain size (i.e. Only three-word phrases, etc.). Where do I draw the line between term extraction and entity extraction?
To gain this insight, we will compare different ways to tokenize the text and perform term extraction using online term extractors and python tools. First, you will manually perform this (building your ground truth), then we will experiment with both online tools and python tools.
This is a great articleLinks to an external site. on comparing the speed of different NLP tokenizers.
Compare Manual, Programmatic, and Online Term Extraction:
Step 1: Manual Term Extraction
Step 2: Automated Term Extraction
Step 3: Programmatic Term Extraction
Step 4: Identify Terms that Represent the Same "Person, Place, Thing, Organization"
Where to Find the Online Term Extraction Engines
Please review the following URLLinks to an external site. for a list of term extractors and pick 2 from the list + use MonkeyLearn.
Previous classes have had success with FiveFilterLinks to an external site..
Note differences between MonkeyLearn and the other term extractors: MonkeyLearn is geared toward keyword extractions.
Once you have used different term extractors, we will try 2 entity extractors.
DandelionLinks to an external site.
wordcount.comLinks to an external site.
What to Do
Step 1: Manual Term Extraction
Using EACH of the documents that you selected previously (for both datasets), manually perform term extraction and entity extraction. What is an entity? What is a concept? If you are asking yourself these questions, think Person, Place, Organization, or Thing. Typically this is what defines your entities. Named entities are the proper forms of these entities (more soon).
In terms of concepts, a lot of my research relates to this (in fact I spent a good portion of my dissertation studying this topic). You can read about this work here on UMBC EquityLinks to an external site.. I describe domain concepts as the concepts that are most describing or representative of a domain (or a discipline). For example, in the domain of "Climate Change", the concept "Black Carbon" is significant. Identifying "Black" and "Carbon" separately is not sufficient in our goal of extracting knowledge from text, so "Black Carbon" was identified as a concept that is important to the "Climate Change" domain. Therefore term extraction is guided by these concepts. In NL text, lots of entities can be found, but not all of them are relevant to the domain. This is where domain concepts become important. However, you may also include terms that are not necessarily domain concepts and not necessarily proper nouns.
In this exercise, you are creating your ground-truth for the upcoming work in later weeks. You are defining what are the important terms and entities for this document collection. You are also defining which of the entities realized in the text represent the same concept (i.e. "President Obama" and "Barack Obama").
To support this task, create a table (for each dataset) with the following column structure:
Column 1: The term or noun phrase - which you extracted manually, and
Column 2: The number of times the term occurred in your Document 1.
Column 3: The number of times the term occurred in your Document 2.
Column 4: The number of times the term occurred in your Document 3.
Column 5: The number of times the term occurred in your Document 4.
Column 6: The number of times the term occurred in your Document 5.
Column 7: The number of times the term occurred in your Document 6.
... Column 11: The number of times the term occurred in your Document 10.
At this stage, you are creating a ground truth; that which you believe to be (sort of) true about your documents.
Now, build on your Step 1 ground-truth by comparing your results with those of two easily accessible online term extraction engines:
Step 2: Run Two Different Automated Term Extraction Engines.
Important Note: You don't have to complete this exercise in full for every single term (or term phrase) in each of your documents. You can decide to focus on the top twenty (or so) terms that you obtained via ground truth. However, you will gain greater insight if you can push through and do as many terms as possible; this will give you a greater understanding of how different term engines operate. (Definitely don't miss the entities and domain concepts).
Add the following columns to your table:
use term extractor 1 for Document 1- store metrics (i.e. the number of times the term occurred in the document, word counts, etc).
use term extractor 2 for Document 1- store metrics (i.e. the number of times the term occurred in the document, wor

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!