Question: Write a function train_classifier(training_set) which takes the single argument training_set, the name of a CSV file as a string, and returns a dictionary of normalised

Write a function train_classifier(training_set) which takes the single argument training_set, the name of a CSV file as a string, and returns a dictionary of normalised trigram-counts (as dictionaries), that is the dictionary should have the format {lang1:trigram_counts1, lang2: trigram_counts2, ...}.

The file training_set is of the following form:

lang1,text1 lang2,text2 lang3,text3 

noting that there may be more than one document per language.

For an example, see example_tset.csv, accessible as a tab at the top right. Note that the contents were taken from Wikipedia articles for the different languages. While the individual documents have been automatically stripped of a lot of the document markup, they still include some formating characters and other noise, which will form part of the trigram counts. Though we won't do anything with it here, dealing with this kind of 'noise' is an important part of the data wrangling step of data science.

We have provided a (hidden) implementation of the function count_trigrams(doc) from the previous question in hidden_lib. This function takes a document (a string) and returns a default dictionary of trigram-counts for the trigrams within the string.

Your code should behave as follows:

>>> d = train_classifier('example_tset.csv') 
>>> d.keys() 
dict_keys(['Indonesian', 'Icelandic', 'English']) 
>>> type(d['English']) 
 
>>> d['English']['g t'] 
0.05794400216170997 

Your code will be tested on a hidden training set which is much much larger than the example set. It contains 3331 documents from Wikipedias of 74 different languages. Consequently, the hidden test case might take a while to run.

My incomplete code

from collections import defaultdict as dd import csv from hidden_lib import count_trigrams from math import sqrt

def normalise(counts_dict): """ normalise takes a dictionary of trigram counts counts_dict and normalises it by it's length.""" mag = sqrt(sum([x**2 for x in counts_dict.values()])) return dd(int, {key: value/mag for (key, value) in counts_dict.items()})

def train_classifier(training_set): """ train_classifier takes a csv filename training_set as a string and returns a dictionary of average trigram-counts per language. """ # your code here. pass

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!