Question: will only be using the u . data file. Within that file you will need the first three columns containing the user id , movie

will only be using the u.data file. Within that file you will need the first three columns containing the user id, movie id, and movie rating respectively and can ignore the fourth column that contains a timestamp.
Use your favorite text editor to manually create a small dataset with around 3 movies and 10 users. Put the data into a file called test.data using the same format used for the real data. You can make up a timestamp by using 0 or similar.
To Turn In: In your README, please briefly answer the following ques- tions:
1. What were your considerations when creating this test data?
2. Were there certain characteristics of the real data and file format that you made sure to capture in your test data?
3. Did you create a reference solution for your test data? If so, how?
In a file called similarity.py, write a program that computes, for each movie, the movie in the dataset that it is most similar to (i.e., the movie with a similarity score closest to 1.0). Write the output to acsv file with columns for the base movie id, the movie id it is most similar to, the similarity score, and the number of common ratings between the two movies. If a movie does not have a match with enough common ratings, then you can set certain entries in the DataFrame to NaN. Your code should roughly look like:
def
def
def compute_similarity(input_file, output_file, user_threshold):
Function to compute similarity scores
Arguments
input_file: str, path to input MovieLens file output_file: str, path to output.csv user_threshold: int, optional argument to specify the minimum number of common users between movies to compute a similarity score. The default value should be 5.
if _name_=="_main_":
input_file = "path/to/input" input"
output_file = "path/to/output"
compute_similarity(input_file, output_file)
Developing your programs using functions will make it more concise, easier to develop/test/debug, and result in components of the original code that can easily be reused in other similar programs. Think about these considera- tions and implement your program in terms of at least two functions. These functions are indicated by the def in the code block above, and they should be called from the compute similarity function, which is what a user would invoke to compute the similarity scores. The block that starts with
if _name_=="_main_":
simply says: if this script is run directly, then run this block too. If elements of this script (like functions) are imported by another program, then don't. run this block.
Your program should be designed in such a way that it can efficiently handle the use of any integer movie and user ids, even if the ids are not consecutive and there are large gaps in the numbers.
Some hints and guidance:
* The easiest way to write your program is with many nested for-loops, extracting data from the movie DataFrame during each iteration. But your solution will take too long to run. Still, it might help to write this simple-but-slow algorithm first to make sure you understand the task, and test it out on your test.data. For your final algorithm, you should aim for something that runs in less than 1 minute on the full dataset.
When optimizing your algorithm, you may want to modify the way the data is stored. Key-value lookups are very fast, so a dictionary might be helpful.
Instead of (or in addition to) a dictionary, a helpful data structure might be a matrix (numpy array) with the M rows representing the movies, and N columns representing the users, with the entry at row i and column j representing the rating given to movie i by user j.
To Turn In: Make sure you repository contains the following:
1. similarity.py with your algorithm implementation
2. test.data
3. A CSV containing first 20 lines from your movie similarity output file. You can create this abridged DataFrame using df abridged = df.head (20).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!