Question: Problem 3. NLP problem setup Encyclopedia Britannicas 3rd edition contains approximately ten thousand articles. After being scanned and converted to text using optical character recognition

Problem 3. NLP problem setup

Encyclopedia Britannicas 3rd edition contains approximately ten thousand articles. After being scanned and converted to text using optical character recognition software, you are given a segment of it in a single text file. The file contains 100,000 text lines / 900,000 words / 300 articles, and has been manually marked up for article start and article finish.

For your reference, an excerpt from the raw text and the marked text is given in the files brit3-excerpt.txt and brit3-excerpt-marked.txt correspondingly. Feel free to open these files in your favorite text editor and have a look.

Instructions: For each of the questions in each of the two problems below, give a 1-2 sentence answer. You don't have to write any code for this problem. Please fill out your answers in the cells below (as markdown text). Note that this problem does not have a single best answer. Use your imagination and be creative!

3.1 Imagine that you need to build a system that would split the given text into articles. Describe how you can cast this task as a classification problem:

  1. What are the instances that you will need to classify?
  2. What are the labels for the instances that your classification function will need to assign?
  3. Assuming you use 2/3 of your marked up data for training how many instances will you have in your training set?
  4. Give at leaves 5 examples of boolean features you might wish to include when building such a classifier.

YOUR ANSWER HERE

3.2 Now imagine that you need to build a system that would both split the text into articles and identify article titles. Again, assume that the titles have been marked in your training set. How can you cast this task as a classification problem?

Please specify answers to (1), (2), (3), and (4) above for this new task.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!