Question: Using R Programming, answer the below For the data file, look under Data download the movies_merged file, and look at pr1.Rmd (https://piazza.com/gatech/spring2017/cse6242/resources) To unzip the

Using R Programming, answer the below

For the data file, look under "Data" download the movies_merged file, and look at pr1.Rmd (https://piazza.com/gatech/spring2017/cse6242/resources)

To unzip the file, download the movies_merged file!

Answer the following:

## 1. Remove non-movie rows

The variable `Type` captures whether the row is a movie, a TV series, or a game. Remove all rows from `df` that do not correspond to movies.

# TODO: Remove all rows from df that do not correspond to movies

## 2. Process `Runtime` column

The variable `Runtime` represents the length of the title as a string. Write R code to convert it to a numeric value (in minutes) and replace `df$Runtime` with the new numeric column.

Now investigate the distribution of `Runtime` values and how it changes over years (variable `Year`, which you can bucket into decades) and in relation to the budget (variable `Budget`). Include any plots that illustrate.

# TODO: Investigate the distribution of Runtime values and how it varies by Year and Budget

## 3. Encode `Genre` column

The column `Genre` represents a list of genres associated with the movie in a string format. Write code to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence, and add it to the dataframe as additional columns. Then remove the original `Genre` column.

For example, if there are a total of 3 genres: Drama, Comedy, and Action, a movie that is both Action and Comedy should be represented by a binary vector <0, 1, 1>. Note that you need to first compile a dictionary of all possible genres and then figure out which movie has which genres (you can use the R `tm` package to create the dictionary).

# TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre

Plot the relative proportions of movies having the top 10 most common genres.

## 4. Eliminate mismatched rows

The dataframe was put together by merging two different sources of data and it is possible that the merging process was inaccurate in some cases (the merge was done based on movie title, but there are cases of different movies with the same title). The first sources release time was represented by the column `Year` (numeric representation of the year) and the second by the column `Released` (string representation of release date).

Find and remove all rows where you suspect a merge error occurred based on a mismatch between these two variables. To make sure subsequent analysis and modeling work well, avoid removing more than 10% of the rows that have a `Gross` value present.

```{r} # TODO: Remove rows with Released-Year mismatch ```

**Q**: What is your precise removal logic and how many rows did you end up removing?

**A**:

## 5. Explore `Gross` revenue

For the commercial success of a movie, production houses want to maximize Gross revenue. Investigate if Gross revenue is related to Budget, Runtime or Genre in any way.

Note: To get a meaningful relationship, you may have to partition the movies into subsets such as short vs. long duration, or by genre, etc.

```{r} # TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre ```

**Q**: Did you find any observable relationships or combinations of Budget/Runtime/Genre that result in high Gross revenue? If you divided the movies into different subsets, you may get different answers for them - point out interesting ones.

**A**:

```{r} # TODO: Investigate if Gross Revenue is related to Release Month ```

## 6. Process `Awards` column

The variable `Awards` describes nominations and awards in text format. Convert it to 2 numeric columns, the first capturing the number of wins, and the second capturing nominations. Replace the `Awards` column with these new columns, and then study the relationship of `Gross` revenue with respect to them.

Note that the format of the `Awards` column is not standard; you may have to use regular expressions to find the relevant values. Try your best to process them, and you may leave the ones that don't have enough information as NAs or set them to 0s.

```{r} # TODO: Convert Awards to 2 numeric columns: wins and nominations ```

**Q**: How did you construct your conversion mechanism? How many rows had valid/non-zero wins or nominations?

**A**:

```{r} # TODO: Plot Gross revenue against wins and nominations ```

**Q**: How does the gross revenue vary by number of awards won and nominations received?

**A**:

## 7. Movie ratings from IMDb and Rotten Tomatoes

There are several variables that describe ratings, including IMDb ratings (`imdbRating` represents average user ratings and `imdbVotes` represents the number of user ratings), and multiple Rotten Tomatoes ratings (represented by several variables pre-fixed by `tomato`). Read up on such ratings on the web (for example [rottentomatoes.com/about](https://www.rottentomatoes.com/about) and [ www.imdb.com/help/show_leaf?votestopfaq](http:// www.imdb.com/help/show_leaf?votestopfaq)).

Investigate the pairwise relationships between these different descriptors using graphs.

```{r} # TODO: Illustrate how ratings from IMDb and Rotten Tomatoes are related ```

**Q**: Comment on the similarities and differences between the user ratings of IMDb and the critics ratings of Rotten Tomatoes.

**A**:

## 8. Ratings and awards

These ratings typically reflect the general appeal of the movie to the public or gather opinions from a larger body of critics. Whereas awards are given by professional societies that may evaluate a movie on specific attributes, such as artistic performance, screenplay, sound design, etc.

Study the relationship between ratings and awards using graphs (awards here refers to wins and/or nominations).

```{r} # TODO: Show how ratings and awards are related ```

**Q**: How good are these ratings in terms of predicting the success of a movie in winning awards or nominations? Is there a high correlation between two variables?

**A**:

## 9. Expected insights

Come up with two new insights (backed up by data and graphs) that is expected. Here new means insights that are not an immediate consequence of one of the above tasks. You may use any of the columns already explored above or a different one in the dataset, such as `Title`, `Actors`, etc.

```{r} # TODO: Find and illustrate two expected insights ```

**Q**: Expected insight #1.

**A**:

**Q**: Expected insight #2.

**A**:

## 10. Unexpected insight

Come up with one new insight (backed up by data and graphs) that is unexpected at first glance and do your best to motivate it. Same instructions apply as the previous task.

```{r} # TODO: Find and illustrate one unexpected insight ```

**Q**: Unexpected insight.

**A**:

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Using R Programming, answer the below For the data file, look under "Data" download the movies_merged file, and look at pr1.Rmd (https://piazza.com/gatech/spring2017/cse6242/resources) Answer the...

Instructions 1. Type your data assignment in MS word and then convert it into a pdf for uploading on to Canvas. Use your last name for naming the final submission pdf file. 2. In your pdf file you...

Task 3 Using multiple regression, we can attempt to create an even better predictive model of dyslexia. a) By definition, dyslexia should be independent of children's motor skills. Calculate the...

ava project Please help me with this java project. link to the website:http: //cs.boisestate.edu/~cs121/projects/p4/...

Please help me with this java project. link to the website:http: //cs.boisestate.edu/~cs121/projects/p4/...

Matlab ENCMP 100-Computer Programming for Engineers Page 1 of 6 ENCMP 100 - Computer Programming for Engineers Assignment #4 Rev 2 Due: Friday, March. 19 2021 at 6:00pm MST Objective This assignment...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Java programming (BlueJ) please and thank you. 13a: 72 34 86 95 47 46 36 20 41 74 67 54 74 18 56 3 85 90 26 27 88 26 61 56 97 98 44 35 21 29 15 43 10 5 68 60 90 47 32 64 28 79 98 92 75 18 36 32 24 91...

Water Consumption This week you'll write a program that performs some analysis on actual, real-life data! My friend Brad moved into a new house one that has a device that monitors how much...

A bond, bought at a price of 7961.123, redeemed at par, with face amount 10000 and coupons payable semiannually at a nominal annual interest rate of r compounded semiannually was issued on April 10,...

Q1. How does a company obtain the relevant pricing for a products or service in the Construction Industry? give examples Q2. List and describe 4 different types of drawings that may form part of the...

3.22 Two cards are drawn in succession from a deck without replacement. Find the probability distribution for the number of spades.

Andrea Company manufactures a part for its production cycle. The annual costs per unit for 20,000 units of this part are as follows: Direct materials $15 Direct labor 12 Variable indirect production...

6. Go to www.adayanaauto.com, the Web site for Adayana Automotive, an outsourcing company that specializes in training for the automotive industry. What services does Adayana Automotive provide? If...

4. www.milliken.com is the Web site for Milliken and Company, which produces highquality textiles and chemical products. Click on Industry Leadership and then Education. What is the mission of...

2. Find a companys annual report by using the World Wide Web or visiting a library. Using the annual report, do the following: a. Identify the companys mission, values, and goals. b. Find any...