Question: As you may suspect, the 4 length common substring heuristic is just a simple approximate technique to find words with common roots. It is liable

As you may suspect, the 4 length common substring heuristic is just a simple approximate technique to find words with common roots. It is liable to fail in many situations. One failure situation is unrelated words with common substrings. For example, consider the following words:

 Ionization, Ionic, Actualization, Actual 

A string is comprised of words defined as continuous runs of alphanumeric characters separated by separators (spaces, commas, periods, semi colons, exclamation marks, any other punctuation symbol except apostrophes').

So a string might look like this :

The hungry scanner keeps a suspicious watch on doctors and their unsuspecting patients

The scanner counts words, collecting those together where a common substring of length 4 or greater occurs.

For example, in the given sentence, suspicious and unsuspecting have a common substring of length 4 "susp". Thus the scanner would output something like this :

Clearly the first two words have a common root, as do the last two. Unfortunately, the simple approach also identifies Ionization and Actualization as having a common root tue to the presence of the string "tion". We have a situation where Actualization could be counted in two slots.

Find an approach to "break ties" in these cases. What logic can you apply to declare that [Actual, Actualization] is a better match than [Actualization, Ionization] ?

Describe and implement your logic as a separate subroutine called from the function implemented in Q1.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!