Question: import urllib.request, json sms_corpus = [] with urllib.request.urlopen(https://storage.googleapis.com/wd13/SMSSpamCollection.txt) as url: for line in url.readlines(): sms_corpus.append(line.decode().split('t')) # print the text and label of document 16 docid

import urllib.request, json

sms_corpus = []

with urllib.request.urlopen("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url:

for line in url.readlines():

sms_corpus.append(line.decode().split('\t'))

# print the text and label of document 16

docid = 16

print(sms_corpus[docid])

# print the label of document 16

docid = 16

print(sms_corpus[docid][0])

# print the text of document 16

docid = 16

print(sms_corpus[docid][1])

# TOKENIZER WITH LEMMATIZER & STOPWORDS #

#importing nltk libraries

import nltk

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('stopwords')

def tokenize(doc):

punctuation = ['.', ',', ';', ':', '!', '?', '(', ')', '{', '}', '"', '\'']

lemmatizer = WordNetLemmatizer()

stop_words = set(stopwords.words('english'))

# Tokenize the document into words

words = word_tokenize(doc)

# Convert to lowercase, remove punctuation, lemmatize, and remove stopwords

words = [lemmatizer.lemmatize(word.lower()) for word in words

if word not in punctuation and word.lower() not in stop_words]

return words

# testing the tokenizer on the doc id's text

docid = 14

print(sms_corpus[docid][0]) #label

print(tokenize(sms_corpus[docid][1])) #tokenized document

from math import log

log(1)

-----------------------------

QUESTIONS:

Q1: I provided the codes above and I used below code to Calculate scores (Computing TF-IDF scores) for every token in the corpus. Store these scores in a dictionary called token_scores.

My problem is it get tokenized by sentence and not word (check the output). How can I calculate by word and using array. Kindly explain too please.

token_scores = {}

def calculate_token_scores(tokenized_docs):

token_scores = {}

# Count the frequency of each token in the corpus

token_counts = {}

for doc in tokenized_docs:

for token in doc:

if token in token_counts:

token_counts[token] += 1

else:

token_counts[token] = 1

# Calculate the token scores based on their frequency

total_docs = len(tokenized_docs)

for token, count in token_counts.items():

token_scores[token] = log(total_docs / count)

return token_scores

calculate_token_scores(sms_corpus)

Q2: What tokens are most predictive of a message being SPAM? Can you please explain

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

A B 4 4 LASER MARK APPROX 1/4" TALL CHARACTERS LASER MARK "19202121-01-A" BLANK SHOWN FLAT PATTERN UP 90 R.13 3.40 .46 3 1.90 3 .13 2 ZONE REV. DESCRIPTION A INITIAL RELEASE PART NO: 19202121-01...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Instructions In mapped.py complete the mapped Function For input1 and input2 you will be given lists For input1 use the map function to figure out how many even numbers there are For input2 use the...

# Python Programming Assignment No. 7 - Problem 2 # Answer each of the questions below within the code as comments # Answer each of the questions below within the code as comments # Question No....

IT LITERALLY WILL NOT KEEP AN INDENT HERE IT IS IN A DOC https://docs.google.com/document/d/1AzsBv1PeVX1vbJTmsi40NrWXHitNs3qQU61nl8pQjTo/edit?usp=sharing i need help allowing a text file of whatever...

How to use json apis python? In Python: Using the same function as part 1, implement functionality to handle requests of the form "{request:reviewsByID,ID:ubit}" to return a list of the favorite...

I am trying to figure out how to do that averages from json file after getting them. I have the following code, but it keeps giving me the following error: TypeError: can only concatenate str (not...

Exploring the module: (Turn in a Python file for this section) Remember how we used urllib to scrape content from a website? import urllib.request web_page = urllib.request.urlopen( ) contents =...

The goal of this assignment is to write a program that will scan a web page and harvest as many email addresses as possible. Many of these email address will be obfuscated in some way. You're job is...

Question1 During his presidential campaing in 2016, Donald Trump said 1 know words. I have the best words." Lets explore this idea in an objective, scientific manner. Write a function...

National Chemical Company manufactures a chemical compound that is sold for $ 5 5 per gallon. A new variant of the chemical has been discovered, and if the basic compound were processed into the new...

The bar shown in Fig. 11.12(a) is attached at its ends to unyielding supports. There is no initial stress in the bar. A force P is then applied as shown. (a) What are the forces in the steel and...

A supplier to an OEM company is going to use Scrum to develop a system. The Manufacturing company has laid out 245 functional requirements and 19 external interface requirements. The supplier has...

Tax Software Assignment - Fall 2023 Ms. Jeanette Letourneau (SIN 123-456-789) was born in Montreal on December 15, 1979. She has spent most of her working life a song writer but also has a part-time...

You are to examine two flow geometries as depicted in the figure. The flow rate in the main pipe is to be maintained constant and equal to Q in both scenarios. To make the comparison simple, it will...

Question 4 (15 points) Stata Statistical Software Inc. has reached maturity and now expects to generate earnings before interest and taxes of EBIT=$1,000,000 per year in perpetuity. That is, no...

Provide the following for the following challenge exercise: a) Income Statement, Gross Margin Standard, year-to-date b) All Journal Entries c) Customer Aged Detail, all customers, with terms at Mar...

A firm has total assets of $ 1 0 0 and total debt of $ 2 5 . Equity is $ 7 5 . Debt - equity ratio is 3 . Debt ratios is 2 5 % . Equity ratio is 7 5 % .

Modify the ClockPane class in Section 14.12 to draw the clock with more details on the hours and minutes, as shown in Figure 14.52a. 12 11 10 22:44:37 (a) 3.

The gcd(m, n) can also be defined recursively as follows: If m % n is 0, gcd(m, n) is n. Otherwise, gcd(m, n) is gcd(n, m % n). Write a recursive method to find the GCD. Write a test program that...

Write a program that displays all possible combinations for picking two numbers from integers 1 to 7. Also display the total number of all combinations. 1 2 1 3 The total number of all combinations...

???????? ????What types of research studies would be needed to determine whether rapid eye movements are a critical component of the benefits of EMDR? A controversial technique has emerged in the...

T F If concentrating on your schoolwork has become difficult because of the breakup of a recent romance, you could be experiencing a psychological disorder. (p. 146)

5.6 Evaluate methods used to treat phobic disorders.