Dataset Description: This dataset is about Customer Support posts from the biggest brands on Twitter. This is
Question:
Dataset Description:
This dataset is about Customer Support posts from the biggest brands on Twitter. This is a
modern corpus of posts and replies and considered to be a large dataset. This dataset supports
to understand natural language processing and conversational models. The dataset is a csv file
and consists of consumer tweet and response from company. In addition to user pos, it also
contains metadata such as tweet_id, user_id, date and time of creation and other attributes.
Few sample tweets and replies:
1 @sprintcare I have sent several private messages and no one is responding as usual
2 @115712 Please send us a Private Message so that we can further assist you. Just click
†̃Message’ at the top of your profile.
3 @Ask_Spectrum The correct way to do it is via an OCS Account Takeover and email
Consent Form it does not need to be done in a local office
4 @115717 Hello, My apologies for any frustrations or inconvenience. I’d be happy
to look into this for you? ^MG
5 @115713 This is saddening to hear. Please shoot us a DM, so that we can look into this
for you. -KCNIT6003 Applied Natural Language Processing
Tasks to do:
Python code in Jupyter notebook to perform the following tasks.
1. Load the dataset:
Download the dataset that contains over 1 million posts from the link
https://www.kaggle.com/thoughtvector/customer-support-on-twitter
Load the dataset using the pandas library.
2. Tokenization:
This is the first step in the text pre-processing technique and the raw data is converted into
small tokens.
3. Lower Casing:
Lowercasing is a common pre-processing technique. You have to convert the input text
into same casing format. For ex., 'review', 'Review' and 'REVIEW' are treated the same way
and can be converted into 'review'.
4. Punctuation Removal:
The punctuations need to be removed should be chosen carefully depending on the dataset.
string.punctuation in python contains the following punctuation symbols !..?@[\\]^_{|}~`
5. Removal of stop words:
Stop words such as "the, a, an" and so on are most frequently occurring in the dataset and
not adds much value to the data analysis. So, they can be removed. The These stop word
lists are already compiled for different languages and we can safely use them. For
example, the stop word list for English language from the nltk package can be seen
below.
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
6. Stemming
Stemming involves converting the words into it's base form. For ex, if there are two words
in the dataset such as "walks" and "walking", they can be converted to it's base form as
"walk". There are several types of stemming algorithms exist and you can use porter or
snowball stemmer. Justification needs to be provided in using the appropriate stemmer
7. Lemmatization
Lemmatization is similar to stemming but takes a careful approach while removing
inflections. Lemmatization doesn't simply chop off inflections but consider the lexical
meaning to obtain the correct base terms. For ex., geese to goose.
8. Removal of emojis, emoticons, and URLs:
As this dataset is related to social media, there could be lot of emojis (ex., "Hilarious�������)
or emoticons (ex., hilarious :)) and Urls. So, remove any emojis, emoticons or URLs in the
data.
Requirements:
1. Python code to perform the tasks listed above (1-8) in Jupyter notebook
2. A report that illustrates the dataset, various pre- processing steps, outputs after each pre-processing technique
ref: self learning
International Marketing And Export Management
ISBN: 9781292016924
8th Edition
Authors: Gerald Albaum , Alexander Josiassen , Edwin Duerr