Question: What should be the key and value for mapper's input? What should mapper output? What should be the key and value for reducer's input? What
What should be the key and value for mapper's input? What should mapper output? What should be the key and value for reducer's input? What should reducer output?
In this problem, you need to write a MapReduce program to find the top-N pairs of similar users from a repository called Movielens. Provided is the dataset of Movielens 20M, which can be found in http://grouplens.org/datasets/movielens/20m/. The original range of the rating scores is in [0.5,5.0]. For simplicity, we ignore the missing values in the ratings. You need to calculate the top-N pairs of similar users based on the ratings. In order to calculate the similarity, we redefine the rating score in [0.5,2.5] as unlike", and the rating score in (2.5,5.0] aslike. Thus you can preprocess the dataset as (user, movie, like') or (user, movie, 'unlike'). Given a user i, you can construct a set of movies with 'like' (denoted by Li), and also a set of movies with unlike' (denoted by Ui). Then, for a pair of users of (i,j), you can calculate the similarity between them via the following metric as J accard = (Liu Lj) U (Ui U Uj) You output the top-N pairs of similar users based on the values of Jaccard. (Hint: You can leave the final sorting step to a UNIX utility called "sor". Your MapReduce program needs only to calculate the metric of Jaccard.) Please write a map reduce program (one mapper and one reducer) to list the top-100 pairs of similar users with similarity (from most similarity to least similarity) Each line should have a pair of users' ID and the similarity. A valid example is as follows: "2" "3" 0.75 In this problem, you need to write a MapReduce program to find the top-N pairs of similar users from a repository called Movielens. Provided is the dataset of Movielens 20M, which can be found in http://grouplens.org/datasets/movielens/20m/. The original range of the rating scores is in [0.5,5.0]. For simplicity, we ignore the missing values in the ratings. You need to calculate the top-N pairs of similar users based on the ratings. In order to calculate the similarity, we redefine the rating score in [0.5,2.5] as unlike", and the rating score in (2.5,5.0] aslike. Thus you can preprocess the dataset as (user, movie, like') or (user, movie, 'unlike'). Given a user i, you can construct a set of movies with 'like' (denoted by Li), and also a set of movies with unlike' (denoted by Ui). Then, for a pair of users of (i,j), you can calculate the similarity between them via the following metric as J accard = (Liu Lj) U (Ui U Uj) You output the top-N pairs of similar users based on the values of Jaccard. (Hint: You can leave the final sorting step to a UNIX utility called "sor". Your MapReduce program needs only to calculate the metric of Jaccard.) Please write a map reduce program (one mapper and one reducer) to list the top-100 pairs of similar users with similarity (from most similarity to least similarity) Each line should have a pair of users' ID and the similarity. A valid example is as follows: "2" "3" 0.75
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
