Question: Write a function train_classifier(training_set) which takes the single argument training_set, the name of a CSV file as a string, and returns a dictionary of normalised

Write a function train_classifier(training_set) which takes the single argument training_set, the name of a CSV file as a string, and returns a dictionary of normalised trigram-counts (as dictionaries), that is the dictionary should have the format {lang1:trigram_counts1, lang2: trigram_counts2, ...}.

The file training_set is of the following form:

lang1,text1 lang2,text2 lang3,text3

noting that there may be more than one document per language.

For an example, see example_tset.csv, accessible as a tab at the top right. Note that the contents were taken from Wikipedia articles for the different languages. While the individual documents have been automatically stripped of a lot of the document markup, they still include some formating characters and other noise, which will form part of the trigram counts. Though we won't do anything with it here, dealing with this kind of 'noise' is an important part of the data wrangling step of data science.

We have provided a (hidden) implementation of the function count_trigrams(doc) from the previous question in hidden_lib. This function takes a document (a string) and returns a default dictionary of trigram-counts for the trigrams within the string.

Your code should behave as follows:

>>> d = train_classifier('example_tset.csv')

>>> d.keys()

dict_keys(['Indonesian', 'Icelandic', 'English'])

>>> type(d['English'])

>>> d['English']['g t']

0.05794400216170997

Your code will be tested on a hidden training set which is much much larger than the example set. It contains 3331 documents from Wikipedias of 74 different languages. Consequently, the hidden test case might take a while to run.

My incomplete code

from collections import defaultdict as dd import csv from hidden_lib import count_trigrams from math import sqrt

def normalise(counts_dict): """ normalise takes a dictionary of trigram counts counts_dict and normalises it by it's length.""" mag = sqrt(sum([x**2 for x in counts_dict.values()])) return dd(int, {key: value/mag for (key, value) in counts_dict.items()})

def train_classifier(training_set): """ train_classifier takes a csv filename training_set as a string and returns a dictionary of average trigram-counts per language. """ # your code here. pass

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Download the plain text file "twitter _ emoticons.dat". The file contains actual Twitter data about tweets that contained emoticons. Each row in the file represents a single tweet. Each row contains...

How do I do these three questions? 3. add_rows [15] Write a function called add_rows which takes a CSV filename string and a list of dictionaries (which represent rows) as arguments and appends each...

Help with zip files for source codes... MyClass.py and pd4.py MDS FILE TEST RESULTS WITH NOCT Introduction Below is a "sort of" partial class diagram for our PV project. We'll attempt to write some...

Help with zip files, thanks ! Introduction Below is a sort of partial class diagram for our PV project. We'll attempt to write some basic 00 Python programming logics that will be expanded down the...

Problem Write a function tra1n class1f1er (tra1ning set) which takes the single argument training set, the name of a CSV file as a string. and returns a dictionary of normalised trigram-counts (as...

Need help with source code MDS FILE Test Results Below is a sort of partial class diagram for our PV project. We'll attempt to write some basic OO Python programming logics that will be expanded down...

Problem 1: convert strings in snake_case to lowerCamelCase or UpperCamelCase. * * C programmers often use snake_case when naming functions or variables. In * JavaScript, we use lowerCamelCase (first...

The answer to this question should be in Python Language (Python 3) There are some requirements too which I have posted in the picture and also the question itself. I have spending a very long time...

In this project you will develop tools for performing sentiment analysis on a database of tweets from across the country. When the project is complete you should be able to estimate the sentiment of...

Please answer questions 2 and 4: Please use python! In this project, you will develop tools for performing sentiment analysis on a database of tweets from across the country. When the project is...

Finding Financial Information Refer to the financial statements of Urban Outfitters in Appendix C at the end of the book. Required: 1. What is the companys revenue recognition policy? 2. Assuming...

Describe how special lexicons are used in identification of sentiment polarity.

Determine the significance of the lease transaction within a joint consolidation. Consider the following in your response: GAAP application VIE considerations

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

6. Select and design outplacement strategies that minimize the negative effects on displaced employees and survivors.

2. Discuss why a career path is necessary for all employees and why dual-career paths are important for professional and managerial employees.

7. Explain why retirees may be valuable as part-time employees.