In this assignment, you will construct a constituency tree and implement the task of POS tagging using
Question:
In this assignment, you will construct a constituency tree and implement the task of POS tagging using constituency parsing and dependency parsing methods.
Constituency parsing is the process of converting natural language input into structured hierarchical representations of grammatical constituents. There are many ways to produce constituency parses, including but not limited to dynamic programming and search methods. A popular approach for statistical constituency parsing is the CKY algorithm, a bottom-up parsing method with a dynamic programming approach.
Dependency parsing is the process of automatically assigning semantic dependency relations to pairs of words—for example, to identify a predicate and its subject, or a nominal and its modifier. Extracting this information can help NLP systems understand meaning and communicative intent. For this assignment, you will explore a high-performing pretrained dependency parser, available as part of the Stanford CoreNLP pipeline and accessible via the NLTK Python library.
2 Instructions
Code
This section needs to be completed using Python 3.6+. You will also require the following packages:
pandas
numpy
NLTK or SpaCy
Java version 1.8+
If you want to use an external package for any reason, you must get approval from the course staff before submission.
Written
In this section, you will need to write answers in Microsoft word or LATEX and convert it to PDF format
Q1
Q3b
3 Questions
Q1. For a given sentence: "Cat sat on the mat", construct a constituent tree using the following production rules: Written [5]
S → VP
VP → V NP PP
NP → DET ADJ N
PP → P NP
NP → DET N
First, identify the main verb in the sentence and draw a horizontal line to represent the root node of the tree. This will be the starting point for the rest of the tree.
Identify the subject of the verb and draw a vertical line coming off the root node. Label this line with the word "NP" for noun phrase.
Identify any direct objects or indirect objects of the verb and draw additional vertical lines coming off the root node. Label these lines with the word "NP" as well.
For each noun phrase, determine what words make up the phrase. These can include determiners (e.g. "the," "a," "an"), adjectives, and nouns. Draw horizontal lines coming off the noun phrase lines to represent each of these words and label them with the appropriate part of speech (e.g. "DET," "ADJ," "N").
If the sentence has any modifying phrases or clauses, draw additional lines coming off the relevant words in the tree and label them with the appropriate constituent label (e.g. "PP" for prepositional phrase, "SBAR" for subordinating clause).
Continue this process until you have identified and labeled all of the constituents in the sentence.
Q2. Constituency Parsing using CYK Algorithm: Code [30]
In this task, you will implement CKY parsing algorithm for POS tagging task on Air Travel Information Service (ATIS) dataset.
Load ATIS Context-Free Grammar data provided in NLTK library using
nltk.data.load("grammars/large_grammars/atis.cfg").
Convert this Context-Free Grammar (CFG) to Chomsky Normal Form (CNF) using
chomsky_normal_form()1 method. It creates a list of productions rules (nltk.grammar.Production).
Using the production rules found in step 1 and the algorithm found in Section 13.2.2 of book "Speech and Language Processing" 2, Fig. 13.5 CKY algorithm, construct the CKY parse table.
function CKY-PARSE(words, grammar) returns table
for j-from 1 to LENGTH(words) do
for all {A | A → words[j] € grammar}
table[j - 1, j-table[j- 1, jUA
for i-from j- 2 down to 0 do
for k for all {A|A → BC € grammar and B € table i, k and C E table k, jl}
3. Alter the algorithm as follows:
For each entry in the table, maintain a record of the back-pointers for each non- terminal node, i.e., each non-terminal node is paired with pointers to the table entries from which it was derived.
For each entry, permit multiple versions of same non-terminal node to be entered in the table. For example, if a Start node ????1 is derived for a table cell (1,3), another Start node ????2 can be entered, if derived, in the same table cell (1,3).
4. Search for the Start node in the table and with the back-pointers generated in step 3, recursively find the nodes from where the current node is derived from till you reach all the terminal nodes. Use nltk.tree.Tree structure to store the parsing tree.
If the Start node is not found, show the output "The sentence is not valid".
If there are multiple Start nodes, construct a tree for each Start node.
5. Print all possible parse trees for following sentences using tree.draw().
What is the cheapest one way flight from columbus to indianapolis.
Is there a flight from memphis to los angeles.
What aircraft is this.
Show american flights after twelve p.m. from miami to chicago
Q3a. Dependency Parsing - POS tagging: Code [10]
Implement get_dependency_parse function in dep_parser.py which should return a list of words and their associated dependency relations in a CoNLL-formatted string3 given an input sentence (string). The CoNLL-formatted string should include a line for each word, with tab-separated columns indicating the word, its POS tag, the index of its head word, and its relation to the head word.
To retrieve this information you should use the Stanford CoreNLP dependency parser, which is accessible via the NLTK library in Python. You will need to do the following to get the CoreNLP server up and running:
Make sure that you have Java version 1.8+ setup on your system. Although NLTK is a Python library, the backend for the CoreNLP parser is a Java-based server.
Download the Stanford CoreNLP server and model JAR files using the link here: https://nlp.stanford.edu/software/stanford-corenlp-latest.zip. Note that this will be a large (approximately 483 MB) file, so it may take some time to download.
Unzip the file and ensure that you have downloaded version 4.5.0 (this should be indicated in the title of the unzipped folder).
Navigate to the unzipped directory using your terminal and run the following command:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer-annotators
"tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000-timeout 30000.
You can navigate to the directory using cd: cd stanford-corenlp-4.5.0. You can use pwd to print your current directory in order to determine how to navigate to stanford-corenlp-4.5.0.
Once you have completed these steps, you can run the dependency parser within your function using the raw_parse method.4 Refer to the skeleton code for additional tips. Supplementary material: dep_parser.py
Q3b. Dependency Parsing - POS tagging: Written [5]
Run the following sentences through your get_dependency_parse function:
Flying planes can be dangerous.
Amid the chaos I saw her duck.
For each sentence, provide the CoNLL-formatted output string returned by your function. Indicate which labels are (or might be) incorrect or ambiguous and why the parser might have mislabeled them. Then, create your own ambiguous sentence, and run it through your get_dependency_parse function. Indicate whether (and how, if applicable) the output was affected by the ambiguity.
Links
1https://www.nltk.org/api/nltk.grammar.html#module-nltk.grammar
2https://web.stanford.edu/~jurafsky/slp3/13.pdf
3https://aclanthology.org/W06-2920.pdf
4https://www.nltk.org/api/nltk.parse.corenlp.html
Fundamental Managerial Accounting Concepts
ISBN: 978-1259569197
8th edition
Authors: Thomas Edmonds, Christopher Edmonds, Bor Yi Tsay, Philip Olds