Question: will only be using the u . data file. Within that file you will need the first three columns containing the user id , movie

will only be using the u

.

data file. Within that file you will need the first three columns containing the user id

,

movie id

,

and movie rating respectively and can ignore the fourth column that contains a timestamp.

Use your favorite text editor to manually create a small dataset with around

3

movies and

10

users. Put the data into a file called test.data using the same format used for the real data. You can make up a timestamp by using

0

or similar.

To Turn In: In your README, please briefly answer the following ques

-

tions:

1 .

What were your considerations when creating this test data?

2 .

Were there certain characteristics of the real data and file format that you made sure to capture in your test data?

3 .

Did you create a reference solution for your test data? If so

,

how?

In a file called similarity.py

,

write a program that computes, for each movie, the movie in the dataset that it is most similar to

(

.

.,

the movie with a similarity score closest to

1.0) .

Write the output to acsv file with columns for the base movie id

,

the movie id it is most similar to

,

the similarity score, and the number of common ratings between the two movies. If a movie does not have a match with enough common ratings, then you can set certain entries in the DataFrame to NaN. Your code should roughly look like:

def

def compute

_

similarity

(

input

_

file, output

_

file, user

_

threshold

)

Function to compute similarity scores

Arguments

input

_

file: str

,

path to input MovieLens file output

_

file: str

,

path to output.csv user

_

threshold: int, optional argument to specify the minimum number of common users between movies to compute a similarity score. The default value should be

5 .

_

name

_= = "_

main

_"

input

_

file

=

"path

/

/

input

"

input"

output

_

file

=

"path

/

/

output

"

compute

_

similarity

(

input

_

file, output

_

file

)

Developing your programs using functions will make it more concise, easier to develop

/

test

/

debug

,

and result in components of the original code that can easily be reused in other similar programs. Think about these considera

-

tions and implement your program in terms of at least two functions. These functions are indicated by the def in the code block above, and they should be called from the compute similarity function, which is what a user would invoke to compute the similarity scores. The block that starts with

_

name

_= = "_

main

_"

simply says: if this script is run directly, then run this block too. If elements of this script

(

like functions

)

are imported by another program, then don't. run this block.

Your program should be designed in such a way that it can efficiently handle the use of any integer movie and user ids, even if the ids are not consecutive and there are large gaps in the numbers.

Some hints and guidance:

*

The easiest way to write your program is with many nested for

-

loops, extracting data from the movie DataFrame during each iteration. But your solution will take too long to run. Still, it might help to write this simple

-

but

-

slow algorithm first to make sure you understand the task, and test it out on your test.data. For your final algorithm, you should aim for something that runs in less than

1

minute on the full dataset.

When optimizing your algorithm, you may want to modify the way the data is stored. Key

-

value lookups are very fast, so a dictionary might be helpful.

Instead of

(

or in addition to

)

a dictionary, a helpful data structure might be a matrix

(

numpy array

)

with the M rows representing the movies, and N columns representing the users, with the entry at row i and column j representing the rating given to movie i by user j

.

To Turn In: Make sure you repository contains the following:

1 .

similarity.py with your algorithm implementation

2 .

test.data

3 .

A CSV containing first

20

lines from your movie similarity output file. You can create this abridged DataFrame using df abridged

=

.

head

(20) .

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Ineed help in writing hive command for these questions: u.data table -- The dataset has 100000 ratings by 943 users on 1682 movies. The file has 4 tab ("\t") separated columns. The first column is...

I need help in writing SQL command for these questions: u.data -- The dataset has 100000 ratings by 943 users on 1682 movies. The file has 4 tab ("\t") separated columns. The first column is the user...

write the code for Wordcount.java to display the output for Average rating and the number of user who rated the movie, the u.data set The following file is from Movielens dataset which shows user...

please use printf and scanf C Programming Assignment 10 Due Friday, April 20, 2018 For the program assigned below, submit the following: Hard copy of the source code and output file Copy of the...

Please write a c file For the program assigned below, submit the following: Hard copy of the source code and output file Copy of the source code (.c file), input file, and output file saved as...

The scores for 20 hypothetical students in a class on three tests are shown below. Student ID Number Test 1 Test 2 Test 3 100 57 89 72 101 75 76 80 102 90 95 85 103 35 58 50 104 88 82 62 105 64 67 75...

Okay so this project is kinda complicated and i need hekp getting all the classes and driver working together to execute. If anyone can help me set up the MovieSeating.java class it would be greatly...

Assignment #7 CSE110 - Arizona State University Topics 2-Dimensional Arrays Classes Searching Reading from a file Coding Guidelines: Give identifiers semantic meaning and make them easy to read...

**PROVIDE SCREENSHOTS OF WHAT IS ASKED** Run a recommender on the MoveLens dataset. (Create a directory for movie lens dataset) mkdir MovieLens cd MovieLens wget...

a. Using the BlackScholesMerton formula, determine the price of a put option. b. Verify whether the options prices obtained in this and the previous question satisfy put call parity.

A pair of synchronous machines, on the same shaft, may be used to generate power at 60 Hz from the given source of power at 50 Hz. Determine the minimum number of poles that the individual machines...

Use the following words to complete the paragraph that follows. Robots can collect data from their surroundings by using . . The data is then sent to a to allow the robot to build up an image of...

Which of the following was used as evidence for the exstence of phlogiston