

Part Two Review the airline sentiment data set included with this exam. This data set is saved in an Excel spreadsheet and contains 4 columns (ID, sentiment, airline, and text). Sentiment can be either positive, negative, or neutral. Answer the following questions. You may need to do some analysis with Python on the data set to answer some of these questions. Some of the sections below are tasks instead of questions. If the question is labelled as a task, perform the task. There will not be a question to answer for these tasks. Use the tools available for these tasks. For example, sometimes it is easier to accomplish something in Excel rather than writing code to do the task. 1. (5 points) The data set is somewhat imbalanced. There are fewer positive tweets and many more negative tweets. Explain how this might impact your ability to create analytical models and how you might be able to address the problem. Task: Create a new data set that consists of the first 2000 tweets in each of the three categories (positive, neutral, negative). The data set should have only two columns, which are the tweet and the sentiment. Make sure that you use the first 2000 ID numbers in each category to make sure that we are all using the same data set. Remove the tag with the airline name from the text of these tweets so that the tweet does not have an identifier for the airline. This can all be done in Excel quite easily. Task: Load the data set into a Pandas data frame. Do all necessary preprocessing and vectorization to prepare the data for analysis. Make sure that there are no non-alpha characters in the data result. Use a snowball stemmer to stem the output and use a TFIDF vectorizer that has no more than 500 features. Each token (stemmed word) should be in at least 10 documents, but no more that 60% of documents in the entire corpus should contain any specific token. Create a 3-node cluster with the data set using the tokens, but not the sentiment tag. 2. (5 points) How many tweets are in each of the 3 clusters? 3. (15 points) What are the centroid values for the first token in each of the three clusters? Note that since the clusters are all built from the same data structure, the first centroid position for all three clusters should represent the same token. What does this centroid value represent (in other words, can you relate this value back to the work that we did in the data set creation and vectorization process and tell me what this value is)? Can you briefly describe how it this centroid is calculated? Task: Drop the 2000 neutral tweets from the original data set, so that you are left with 4000 tweets (2000 positive and 2000 negative). Dummy code sentiment with positive as 1 and negative as 0. You will use the dummy coded version as your classification target. Redo the preprocessing and vectorization using the same hyerparameters. Since the corpus has changed, your vectorization result will be a bit different. Create a classification model using a decision tree with no more than three levels in the tree. Split the data as 75% training and 25% testing making sure that each partition still has 50% of its content as positive and 50% as negative. 4. (10 points) Report the confusion matrix for the model predictions on the training data. What is the overall accuracy of the model? Are there differences between the recall and precision values for positive versus negative? 5. (10 points) Report the confusion matrix for the model predictions on the testing data. What is the overall accuracy of the model? Is there any evidence of overfitting? If we increased the number of possible levels in the tree from 3 to 300, would you expect that it would increase or decrease the overfit and why? 6. (5 points) Run a prediction for each of these sentences and tell me which class they are predicted to belong to. Did you get the results that you expected? 1. 2. "The seats are very uncomfortable, and the food is cold and stale" I had such a great experience. Everyone was very friendly and accommodating Part Two Review the airline sentiment data set included with this exam. This data set is saved in an Excel spreadsheet and contains 4 columns (ID, sentiment, airline, and text). Sentiment can be either positive, negative, or neutral. Answer the following questions. You may need to do some analysis with Python on the data set to answer some of these questions. Some of the sections below are tasks instead of questions. If the question is labelled as a task, perform the task. There will not be a question to answer for these tasks. Use the tools available for these tasks. For example, sometimes it is easier to accomplish something in Excel rather than writing code to do the task. 1. (5 points) The data set is somewhat imbalanced. There are fewer positive tweets and many more negative tweets. Explain how this might impact your ability to create analytical models and how you might be able to address the problem. Task: Create a new data set that consists of the first 2000 tweets in each of the three categories (positive, neutral, negative). The data set should have only two columns, which are the tweet and the sentiment. Make sure that you use the first 2000 ID numbers in each category to make sure that we are all using the same data set. Remove the tag with the airline name from the text of these tweets so that the tweet does not have an identifier for the airline. This can all be done in Excel quite easily. Task: Load the data set into a Pandas data frame. Do all necessary preprocessing and vectorization to prepare the data for analysis. Make sure that there are no non-alpha characters in the data result. Use a snowball stemmer to stem the output and use a TFIDF vectorizer that has no more than 500 features. Each token (stemmed word) should be in at least 10 documents, but no more that 60% of documents in the entire corpus should contain any specific token. Create a 3-node cluster with the data set using the tokens, but not the sentiment tag. 2. (5 points) How many tweets are in each of the 3 clusters? 3. (15 points) What are the centroid values for the first token in each of the three clusters? Note that since the clusters are all built from the same data structure, the first centroid position for all three clusters should represent the same token. What does this centroid value represent (in other words, can you relate this value back to the work that we did in the data set creation and vectorization process and tell me what this value is)? Can you briefly describe how it this centroid is calculated? Task: Drop the 2000 neutral tweets from the original data set, so that you are left with 4000 tweets (2000 positive and 2000 negative). Dummy code sentiment with positive as 1 and negative as 0. You will use the dummy coded version as your classification target. Redo the preprocessing and vectorization using the same hyerparameters. Since the corpus has changed, your vectorization result will be a bit different. Create a classification model using a decision tree with no more than three levels in the tree. Split the data as 75% training and 25% testing making sure that each partition still has 50% of its content as positive and 50% as negative. 4. (10 points) Report the confusion matrix for the model predictions on the training data. What is the overall accuracy of the model? Are there differences between the recall and precision values for positive versus negative? 5. (10 points) Report the confusion matrix for the model predictions on the testing data. What is the overall accuracy of the model? Is there any evidence of overfitting? If we increased the number of possible levels in the tree from 3 to 300, would you expect that it would increase or decrease the overfit and why? 6. (5 points) Run a prediction for each of these sentences and tell me which class they are predicted to belong to. Did you get the results that you expected? 1. 2. "The seats are very uncomfortable, and the food is cold and stale" I had such a great experience. Everyone was very friendly and accommodating