Question: Part Two Review the airline sentiment data set included with this exam. This data set is saved in an Excel spreadsheet and contains 4 columns

Part Two Review the airline sentiment data set

Part Two Review the airline sentiment data set included with this exam. This data set is saved in an Excel spreadsheet and contains 4 columns (ID, sentiment, airline, and text). Sentiment can be either positive, negative, or neutral. Answer the following questions. You may need to do some analysis with Python on the data set to answer some of these questions. Some of the sections below are tasks instead of questions. If the question is labelled as a task, perform the task. There will not be a question to answer for these tasks. Use the tools available for these tasks. For example, sometimes it is easier to accomplish something in Excel rather than writing code to do the task. 1. (5 points) The data set is somewhat imbalanced. There are fewer positive tweets and many more negative tweets. Explain how this might impact your ability to create analytical models and how you might be able to address the problem. Task: Create a new data set that consists of the first 2000 tweets in each of the three categories (positive, neutral, negative). The data set should have only two columns, which are the tweet and the sentiment. Make sure that you use the first 2000 ID numbers in each category to make sure that we are all using the same data set. Remove the tag with the airline name from the text of these tweets so that the tweet does not have an identifier for the airline. This can all be done in Excel quite easily. Task: Load the data set into a Pandas data frame. Do all necessary preprocessing and vectorization to prepare the data for analysis. Make sure that there are no non-alpha characters in the data result. Use a snowball stemmer to stem the output and use a TFIDF vectorizer that has no more than 500 features. Each token (stemmed word) should be in at least 10 documents, but no more that 60% of documents in the entire corpus should contain any specific token. Create a 3-node cluster with the data set using the tokens, but not the sentiment tag. 2. (5 points) How many tweets are in each of the 3 clusters? 3. (15 points) What are the centroid values for the first token in each of the three clusters? Note that since the clusters are all built from the same data structure, the first centroid position for all three clusters should represent the same token. What does this centroid value represent (in other words, can you relate this value back to the work that we did in the data set creation and vectorization process and tell me what this value is)? Can you briefly describe how it this centroid is calculated? Task: Drop the 2000 neutral tweets from the original data set, so that you are left with 4000 tweets (2000 positive and 2000 negative). Dummy code sentiment with positive as 1 and negative as 0. You will use the dummy coded version as your classification target. Redo the preprocessing and vectorization using the same hyerparameters. Since the corpus has changed, your vectorization result will be a bit different. Create a classification model using a decision tree with no more than three levels in the tree. Split the data as 75% training and 25% testing making sure that each partition still has 50% of its content as positive and 50% as negative. 4. (10 points) Report the confusion matrix for the model predictions on the training data. What is the overall accuracy of the model? Are there differences between the recall and precision values for positive versus negative? 5. (10 points) Report the confusion matrix for the model predictions on the testing data. What is the overall accuracy of the model? Is there any evidence of overfitting? If we increased the number of possible levels in the tree from 3 to 300, would you expect that it would increase or decrease the overfit and why? 6. (5 points) Run a prediction for each of these sentences and tell me which class they are predicted to belong to. Did you get the results that you expected? 1. 2. "The seats are very uncomfortable, and the food is cold and stale" I had such a great experience. Everyone was very friendly and accommodating Part Two Review the airline sentiment data set included with this exam. This data set is saved in an Excel spreadsheet and contains 4 columns (ID, sentiment, airline, and text). Sentiment can be either positive, negative, or neutral. Answer the following questions. You may need to do some analysis with Python on the data set to answer some of these questions. Some of the sections below are tasks instead of questions. If the question is labelled as a task, perform the task. There will not be a question to answer for these tasks. Use the tools available for these tasks. For example, sometimes it is easier to accomplish something in Excel rather than writing code to do the task. 1. (5 points) The data set is somewhat imbalanced. There are fewer positive tweets and many more negative tweets. Explain how this might impact your ability to create analytical models and how you might be able to address the problem. Task: Create a new data set that consists of the first 2000 tweets in each of the three categories (positive, neutral, negative). The data set should have only two columns, which are the tweet and the sentiment. Make sure that you use the first 2000 ID numbers in each category to make sure that we are all using the same data set. Remove the tag with the airline name from the text of these tweets so that the tweet does not have an identifier for the airline. This can all be done in Excel quite easily. Task: Load the data set into a Pandas data frame. Do all necessary preprocessing and vectorization to prepare the data for analysis. Make sure that there are no non-alpha characters in the data result. Use a snowball stemmer to stem the output and use a TFIDF vectorizer that has no more than 500 features. Each token (stemmed word) should be in at least 10 documents, but no more that 60% of documents in the entire corpus should contain any specific token. Create a 3-node cluster with the data set using the tokens, but not the sentiment tag. 2. (5 points) How many tweets are in each of the 3 clusters? 3. (15 points) What are the centroid values for the first token in each of the three clusters? Note that since the clusters are all built from the same data structure, the first centroid position for all three clusters should represent the same token. What does this centroid value represent (in other words, can you relate this value back to the work that we did in the data set creation and vectorization process and tell me what this value is)? Can you briefly describe how it this centroid is calculated? Task: Drop the 2000 neutral tweets from the original data set, so that you are left with 4000 tweets (2000 positive and 2000 negative). Dummy code sentiment with positive as 1 and negative as 0. You will use the dummy coded version as your classification target. Redo the preprocessing and vectorization using the same hyerparameters. Since the corpus has changed, your vectorization result will be a bit different. Create a classification model using a decision tree with no more than three levels in the tree. Split the data as 75% training and 25% testing making sure that each partition still has 50% of its content as positive and 50% as negative. 4. (10 points) Report the confusion matrix for the model predictions on the training data. What is the overall accuracy of the model? Are there differences between the recall and precision values for positive versus negative? 5. (10 points) Report the confusion matrix for the model predictions on the testing data. What is the overall accuracy of the model? Is there any evidence of overfitting? If we increased the number of possible levels in the tree from 3 to 300, would you expect that it would increase or decrease the overfit and why? 6. (5 points) Run a prediction for each of these sentences and tell me which class they are predicted to belong to. Did you get the results that you expected? 1. 2. "The seats are very uncomfortable, and the food is cold and stale" I had such a great experience. Everyone was very friendly and accommodating

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!

John was recently convicted by a jury of committing fraud against his employer. After the trial, it was revealed that some key evidence against John used in trial was obtained through his employers...

Review the airline sentiment data set included with this exam. This data set is saved in an Excel spreadsheet and contains 4 columns (ID, sentiment, airline, and text). Sentiment can be either...

Needing ANSWERS ASAP! Starting at pg 34 - Labeled Graded Project 06155200: Graded Project Instructions & Worksheets 1 Lesson 1: Business, Accounting, and You PROJECT GOAL The goal of this graded...

haa23684_PlugInT6CD.qxd 9/6/06 5:38 PM Page 2 CONFIRMING PAGES P L U G - I N T6 Basic Skills and Tools Using Access LEARNING OUTCOMES 1. Describe the primary functions using Microsoft Access. 2....

I got this big assignmnet for Quantitative methods for business i really need help Azimi, Hamida - azihy004 AH Share Comments Delete this page before submission Q-Constructions - building your future...

Hudson: Lisa: Fred: Jake: We've made a lot of progress researching various ERP vendors. However, ERP software for car dealerships is still fairly expensive for a small dealership like ours. Combine...

Study Guide Healthcare Statistics By Jacqueline K. Wilson, RHIA About the Author Jacqueline K. Wilson is a Registered Health Information Administrator (RHIA) who has more than ten years of experience...

I would like assist for my assessment. The subject is Subsidiary Accounts and Ledgers and Foundation Skills. T-1.8.1 Details of Assessment Term and Year 1, 2017 Time allowed Week 2-7 Assessment No 1...

Refer to Problem 54. Determine the amount of taxable income and separately stated items in each case, assuming the corporation was a Subchapter S corporation. a. Book income of $50,000 including...

An infinitely long, thin conducting sheet defined over the space 0 P(0, 0, 2) RI = V2 +2 Figure P5.10: Conducting sheet of width w in x-y plane.

Six descriptions are shown on the left and six computer terms are shown on the right. By drawing lines, connect each description to its correct term. Areas on a DVD surface where 1s and Os are stored...

As the manager of Smith Construction, you need to make a decision on the number of homes to build in a new residential area where you are the only builder. Unfortunately, you must build the homes...