Question: alldocuments = [] iterate the directories for each file open the file and use readlines() to read all lines fetch the lines related to Subject

alldocuments = [] iterate the directories for each file open the file and use readlines() to read all lines fetch the lines related to "Subject" and the body started with "Lines: number" put all docuement related information into a list ["subdirectory", [list of the subject and body lines]] append the list to the alldocuments iterate allldocuments: for each docuement pre-process the lines part and convert them to terms append a dictionary of {term: term-frequency} to the document structure: now you have [subdirectory", [list of the subject and body lines], {term: term-frequency, term:term-freq, ...}] term-dict = {} term-doc-freq = {} # document-frequency iterate all documents: for each document: for each term in the last part of the term-term-freq dictionary if it's not in term-dict add the term to term-dict, with default feature-id set to 0 if not in term-doc-freq term-doc-freq [term] = 1 else term-doc-freq[term] += 1 get all keys from term-dict and sort them feature-id =0 for each key in term-dict term-dict[key] = feature-id feature-id +=1 class-dict ={} create the dictionary for subdirectory -> class
 
alldocuments = [] iterate the directories for each file open the fileand use readlines() to read all lines fetch the lines related to

In this task, you will develop the program feature-extract.py, which has the following usage. python feature-extract.py directory_of_newsgroups_data feature_definition_file class_definition_file training_data_file where . input: directory_of_newsgroups_data is the directory of the unzipped newsgroups data, output: feature_definition_file contains (term, feature_id) pairs, output: class_definition_file contains (class_name, class_id) pairs, output: training_data_file has a specific format, which will be described later. feature-extract.py contains the following two components 1.1 and 1.2 1.1 document preprocessing. For each document, the preprocessing steps include split a document to a list of tokens and lowercase the tokens. remove the stopwords, using this list of stopwords. stemming. Use a stemmer in NLTK. When you parse the documents, you may only look at the subject and body. The subject line is indicated by the keyword "Subject:", while the number of lines in the body is indicated by "Lines: xx", which are the last xx lines of the document. Some files may miss the "Lines: xx" or have other exceptions. Please manually add the "Lines: xx" line for these few files and appropriately handle the exceptions. If you believe other fields of the documents might also be useful, you are free to include them. If you are interested, you may also try other preprocessing steps described in the class. At the end of preprocessing, each document is converted to a bag of terms. You can collect all the unique terms, sort them, and create the dictionary. 1.2 generating document vector representation. According to the dictionary, you assign each term an integer id - the feature_id, starting from 0. The (term, feature_id) pairs will be written to the feature_definition_file. The 20 newsgroups are grouped into 6 classes: (comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, . A training_data_file is in the libsvm format, i.e., the sparse representation of the document-term matrix. see a sample libsvm file. It's a text file, where each row represents one document and takes the form : : This format is especially suitable for sparse datasets - all the zero-value features do not need to be recorded. According to the newsgroup directory that the document belongs to, you can find the class label from the mapping defined in the class_definition_file. Each feature-id refers to the term defined in feature_definition_file. Depending on the types of values you like, the feature value can be term frequency (TF), inverse document frequency (IDF), TFIDF, or Boolean (for Bernoulli Naive Bayes). You can generate all training data files at once after running feature-extract.py: e.g., training_data_file. TF, training_data_file.IDF, training_data_file.TFIDF, and training_data_file.Boolean. Please carefully test the feature values you generated. Your experiments on classification and clustering may select different training data. For example, Bernoulli Naive Bayes classifiers may prefer Boolean training data. Once you have the training data in the libsvm format, you can easily load it into sklearn as follows. from sklearn.datasets import load_svmlight_file feature_vectors, targets = load_svnlight_file("/path/to/train_dataset.txt") Data pre-processing has been one of the most time-consuming and critical steps in data mining. Make sure you thoroughly test your code

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!