Question: alldocuments = [] iterate the directories for each file open the file and use readlines() to read all lines fetch the lines related to Subject

alldocuments = [] iterate the directories for each file open the file and use readlines() to read all lines fetch the lines related to "Subject" and the body started with "Lines: number" put all docuement related information into a list ["subdirectory", [list of the subject and body lines]] append the list to the alldocuments iterate allldocuments: for each docuement pre-process the lines part and convert them to terms append a dictionary of {term: term-frequency} to the document structure: now you have [subdirectory", [list of the subject and body lines], {term: term-frequency, term:term-freq, ...}] term-dict = {} term-doc-freq = {} # document-frequency iterate all documents: for each document: for each term in the last part of the term-term-freq dictionary if it's not in term-dict add the term to term-dict, with default feature-id set to 0 if not in term-doc-freq term-doc-freq [term] = 1 else term-doc-freq[term] += 1 get all keys from term-dict and sort them feature-id =0 for each key in term-dict term-dict[key] = feature-id feature-id +=1 class-dict ={} create the dictionary for subdirectory -> class

In this task, you will develop the program feature-extract.py, which has the following usage. python feature-extract.py directory_of_newsgroups_data feature_definition_file class_definition_file training_data_file where . input: directory_of_newsgroups_data is the directory of the unzipped newsgroups data, output: feature_definition_file contains (term, feature_id) pairs, output: class_definition_file contains (class_name, class_id) pairs, output: training_data_file has a specific format, which will be described later. feature-extract.py contains the following two components 1.1 and 1.2 1.1 document preprocessing. For each document, the preprocessing steps include split a document to a list of tokens and lowercase the tokens. remove the stopwords, using this list of stopwords. stemming. Use a stemmer in NLTK. When you parse the documents, you may only look at the subject and body. The subject line is indicated by the keyword "Subject:", while the number of lines in the body is indicated by "Lines: xx", which are the last xx lines of the document. Some files may miss the "Lines: xx" or have other exceptions. Please manually add the "Lines: xx" line for these few files and appropriately handle the exceptions. If you believe other fields of the documents might also be useful, you are free to include them. If you are interested, you may also try other preprocessing steps described in the class. At the end of preprocessing, each document is converted to a bag of terms. You can collect all the unique terms, sort them, and create the dictionary. 1.2 generating document vector representation. According to the dictionary, you assign each term an integer id - the feature_id, starting from 0. The (term, feature_id) pairs will be written to the feature_definition_file. The 20 newsgroups are grouped into 6 classes: (comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, . A training_data_file is in the libsvm format, i.e., the sparse representation of the document-term matrix. see a sample libsvm file. It's a text file, where each row represents one document and takes the form : : This format is especially suitable for sparse datasets - all the zero-value features do not need to be recorded. According to the newsgroup directory that the document belongs to, you can find the class label from the mapping defined in the class_definition_file. Each feature-id refers to the term defined in feature_definition_file. Depending on the types of values you like, the feature value can be term frequency (TF), inverse document frequency (IDF), TFIDF, or Boolean (for Bernoulli Naive Bayes). You can generate all training data files at once after running feature-extract.py: e.g., training_data_file. TF, training_data_file.IDF, training_data_file.TFIDF, and training_data_file.Boolean. Please carefully test the feature values you generated. Your experiments on classification and clustering may select different training data. For example, Bernoulli Naive Bayes classifiers may prefer Boolean training data. Once you have the training data in the libsvm format, you can easily load it into sklearn as follows. from sklearn.datasets import load_svmlight_file feature_vectors, targets = load_svnlight_file("/path/to/train_dataset.txt") Data pre-processing has been one of the most time-consuming and critical steps in data mining. Make sure you thoroughly test your code

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

This shell program is in C and a complete solution to the problem except for handling the > operators is provided below. You may use any or all of this code in your solution to this...

This is in C and a complete solution to the problem except for handling the > operators is provided below. You may use any or all of this code in your solution to this assignment. Thanks in...

The Lahman Baseball Database is a comprehensive database of Major League baseball statistics. The journalist Sean Lahman provides all of this data freely to the public. We will make use of some of...

You should download batting.csv and place it in the same directory as your Python code. Your job will be to write a Python program that finds the player ID of the player with the highest total career...

Code is written in python The file can be downloaded here: https://relate.cs.illinois.edu/course/cs101sp17/f/media/batting.csv Calculating baseball statistics in a file 5 points The Lahman Baseball...

1 Ob jective Construct a na ve Bayes classifier to classify email as spam or not spam ("ham"). A Bayesian decision rule chooses the hypothesis that maximizesP(Spam|x) vsP(Spam|x) for emailx. Use any...

(1) The Quality of Financial Information Referencing this week?s readings and lecture, describe the quality issues related to reporting revenue. What is the importance of understanding various...

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

Chapter 7 Revising and Presenting Your Writing I'm not a very good writer, but I'm an excellent rewriter. James A. Michener Half my life is an act of revision. John Irving Getting Started INT RODU CT...

Please scan the SEC Plain English that I've attached. Please visit to this link.http://www.sec.gov/Archives/edgar/data/320193/000119312513416534/d590790d10k.htm#toc590790_9 Please read pages 25...

Three long wires (wire 1, wire 2, and wire 3) are coplanar and hang vertically. The distance between wire 1 and wire 2 is 20.0 cm. On the left, wire 1 carries an upward current of 1.50 A. To the...

a company wants to be perceived by customers. Relationship marketing: marketing strategy with the goal of keeping individual customers over time by offering them products that exactly meet their...

Eve, an architect, hires Frank, an accountant, to handle her accounts and bookkeeping. Dissatisfied with Frank's work, Eve sues him, alleging negligence. Frank may successfully defend himself by...

20% mixture of methanol in ethanol are to be separated by distillation process up to 95% purity for both components. get the BP for individual components

What is the default Aggregation Method in SQL Server Analysis Services in Cube Processing? What are the other options?

What is the default Aggregation Method in SQL Server Analysis Services in Cube Processing? What are the other standard optional methods?

Before starting an SQL Server Analysis Services Multidimensional Modeling Project, why is identification of a Data Source important?