Question: Understanding the different methods one could apply to term extraction will give you insight into your own work. There are lots of questions to ask.

Understanding the different methods one could apply to term extraction will give you insight into your own work. There are lots of questions to ask. Do I want just single words, also known as a bag of words? Or do I wish to extract phrases and words? Or only phrases over a certain size

(

.

.

Only three

-

word phrases, etc.

) .

Where do I draw the line between term extraction and entity extraction?

To gain this insight, we will compare different ways to tokenize the text and perform term extraction using online term extractors and python tools. First, you will manually perform this

(

building your ground truth

),

then we will experiment with both online tools and python tools.

This is a great articleLinks to an external site. on comparing the speed of different NLP tokenizers.

Compare Manual, Programmatic, and Online Term Extraction:

Step

1

: Manual Term Extraction

Step

2

: Automated Term Extraction

Step

3

: Programmatic Term Extraction

Step

4

: Identify Terms that Represent the Same "Person, Place, Thing, Organization"

Where to Find the Online Term Extraction Engines

Please review the following URLLinks to an external site. for a list of term extractors and pick

2

from the list

+

use MonkeyLearn.

Previous classes have had success with FiveFilterLinks to an external site..

Note differences between MonkeyLearn and the other term extractors: MonkeyLearn is geared toward keyword extractions.

Once you have used different term extractors, we will try

2

entity extractors.

DandelionLinks to an external site.

wordcount.comLinks to an external site.

What to Do

Step

1

: Manual Term Extraction

Using EACH of the documents that you selected previously

(

for both datasets

),

manually perform term extraction and entity extraction. What is an entity? What is a concept? If you are asking yourself these questions, think Person, Place, Organization, or Thing. Typically this is what defines your entities. Named entities are the proper forms of these entities

(

more soon

) .

In terms of concepts, a lot of my research relates to this

(

in fact I spent a good portion of my dissertation studying this topic

) .

You can read about this work here on UMBC EquityLinks to an external site.. I describe domain concepts as the concepts that are most describing or representative of a domain

(

or a discipline

) .

For example, in the domain of "Climate Change", the concept "Black Carbon" is significant. Identifying "Black" and "Carbon" separately is not sufficient in our goal of extracting knowledge from text, so "Black Carbon" was identified as a concept that is important to the "Climate Change" domain. Therefore term extraction is guided by these concepts. In NL text, lots of entities can be found, but not all of them are relevant to the domain. This is where domain concepts become important. However, you may also include terms that are not necessarily domain concepts and not necessarily proper nouns.

In this exercise, you are creating your ground

-

truth for the upcoming work in later weeks. You are defining what are the important terms and entities for this document collection. You are also defining which of the entities realized in the text represent the same concept

(

.

.

"President Obama" and "Barack Obama"

) .

To support this task, create a table

(

for each dataset

)

with the following column structure:

Column

1

: The term or noun phrase

-

which you extracted manually, and

Column

2