Question: I have a Python code that can run any datatset by just changing the input file name, What I need assisstance on is getting the

I have a Python code that can run any datatset by just changing the input file name, What I need assisstance on is getting the code to find and detect outliers in the data set. thank you import pandas as pd import numpy as np import re from unidecode import unidecode import string import matplotlib.pyplot as plt import nltk from statistics import mean from wordcloud import WordCloud, STOPWORDS import seaborn as sns from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_validate from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.svm import SVC from sklearn.metrics import precision_recall_fscore_support, classification_report, confusion_matrix DATA = "atla_script.csv" df = pd.read_csv(DATA) print('There are {} rows and {} columns in the training dataset.'.format(df.shape[0], df.shape[1])) df.head() """The five columns in this dataset include: - id: unique id for a news article - title: the title of a news article - author: author of the news article - text: the text of the article - label: a label that marks the article as potentially unreliable - 1: unreliable - 0: reliable After completing the preprocessing/cleaning of the data, I will split the data into train and test sets. ## Preprocessing and Cleaning ### Duplicate Data """ shape = df.shape[0] df = df.drop_duplicates(keep="first") diff = shape - df.shape[0] if (diff > 0): print(f"We removed {diff} duplicates from the dataset") else: print("We found no duplicates in the dataset!") """### Missing Data""" print ("Missing data: ") print(df.isnull().sum()) """We are missing 558 titles, 1957 authors and 39 text entries in the dataset. Since this is not a significant portion of the data and there is really no efficient way to replace/estimate textual data, I am going to drop any entries that contain a missing value.""" shape = df.shape[0] df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) print('Dropped {} entries with missing values'.format(shape - df.shape[0])) print('After dropping the missing values, there are now {} rows and {} columns in the dataset.'.format(df.shape[0], df.shape[1]))

"""### Cleaning def clean(s): s = re.sub(r'<[^\s>]+>', '', s).strip() # remove any tags s = unidecode(s) #remove accented characters s = ''.join(c for c in s if c not in string.punctuation) # remove punctuation s = ' '.join(s.split()) #remove extra whitespace return s #df.text = df.text.apply(lambda u: clean(u)) df.title = df.title.apply(lambda u: clean(u)) print("Cleaned all titles and text!") """### Outliers

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Need help with python code Step 1 : Download the compressed table DB_091803_v1.txt and test input file: IPlist.txt. Upload them to your Colab space using the code below. from google.colab import...

Assignment Instructions and Guidelines: 1. The VM must be written in C and must run on Eustis3. If it runs in your PC but not on Eustis, for us it does not run. 2. The input file name should be read...

Question This question reinforces concepts from Lab Practices 1 - 5. Best practice of Class and Method design should be demonstrated. This will require a good understanding of class design concepts...

Please provide an explanation for the codes. The programming language for this is Java. Question This question reinforces concepts from Lab Practices 1 - 5. Best practice of Class and Method design...

Fix the following code so it works properly with all the functions below python idle 3.5 here is the link to the data file to run the program :...

Can someobody help me this project? Please see the bottom for codes that is already given. This assignment asks you to write a Python interpreter for a "Little" language, Though we've learned about...

Can you post the an example of python code following the requirements below *Urgent* 1 Instructions - Create a Python file named yourGroupNumberAsg04.py and post it on Canvas. - This is a group...

P-Machine Architecture The P-machine is a stack machine that conceptually has one memory area called the process address space (PAS). The process address space is divide into two contiguous segments:...

Compare the characteristics of purchasing schemes to sales schemes.

The frequency of a stream train whistle as it approaches you is 538Hz. After it passes you, its frequency is measured as 486 Hz. How fast was the train moving (assume constant velocity)>

Which of the following best explains why a board of directors maty grant stock options as part of a compensulbon peckupet? Multiple Choice to bring about a separation of CEO / chair duality to reduce...

please give me specific reasons for each section based on their cash flow. and please don't use the same answer that others have posted! Discussion Prompt Theater by Design and Show Cinemas are...

Question When should a benefit planner recommend using a rabbi trust?

Question Can a Keogh plan be established if the self-employed person is covered under a corporate retirement plan of an employer?

Question What is a Roth 403 (b) plan?