Question: I have a Python code that can run any datatset by just changing the input file name, What I need assisstance on is getting the
I have a Python code that can run any datatset by just changing the input file name, What I need assisstance on is getting the code to find and detect outliers in the data set. thank you import pandas as pd import numpy as np import re from unidecode import unidecode import string import matplotlib.pyplot as plt import nltk from statistics import mean from wordcloud import WordCloud, STOPWORDS import seaborn as sns from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_validate from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.svm import SVC from sklearn.metrics import precision_recall_fscore_support, classification_report, confusion_matrix DATA = "atla_script.csv" df = pd.read_csv(DATA) print('There are {} rows and {} columns in the training dataset.'.format(df.shape[0], df.shape[1])) df.head() """The five columns in this dataset include: - id: unique id for a news article - title: the title of a news article - author: author of the news article - text: the text of the article - label: a label that marks the article as potentially unreliable - 1: unreliable - 0: reliable After completing the preprocessing/cleaning of the data, I will split the data into train and test sets. ## Preprocessing and Cleaning ### Duplicate Data """ shape = df.shape[0] df = df.drop_duplicates(keep="first") diff = shape - df.shape[0] if (diff > 0): print(f"We removed {diff} duplicates from the dataset") else: print("We found no duplicates in the dataset!") """### Missing Data""" print ("Missing data: ") print(df.isnull().sum()) """We are missing 558 titles, 1957 authors and 39 text entries in the dataset. Since this is not a significant portion of the data and there is really no efficient way to replace/estimate textual data, I am going to drop any entries that contain a missing value.""" shape = df.shape[0] df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) print('Dropped {} entries with missing values'.format(shape - df.shape[0])) print('After dropping the missing values, there are now {} rows and {} columns in the dataset.'.format(df.shape[0], df.shape[1])) """### Cleaning def clean(s): s = re.sub(r'<[^\s>]+>', '', s).strip() # remove any tags s = unidecode(s) #remove accented characters s = ''.join(c for c in s if c not in string.punctuation) # remove punctuation s = ' '.join(s.split()) #remove extra whitespace return s #df.text = df.text.apply(lambda u: clean(u)) df.title = df.title.apply(lambda u: clean(u)) print("Cleaned all titles and text!") """### Outliers Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
