TDM 10100: Project 7 — 2022
Motivation: A couple of bread-and-butter functions that are a part of the base R are: subset
, and merge
. subset
provides a more natural way to filter and select data from a data.frame. merge
brings the principals of combining data that SQL uses, to R.
Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!
Scope: r, subset, merge, tapply
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/movies_and_tv/titles.csv
-
/anvil/projects/tdm/data/movies_and_tv/episodes.csv
-
/anvil/projects/tdm/data/movies_and_tv/people.csv
-
/anvil/projects/tdm/data/movies_and_tv/ratings.csv
Questions
Question 1
Please select 6000 memory when launching Jupyter for this project. |
Data can come in a lot of different formats and from a lot of different locations. It is not uncommon to have one or more files that need to be combined together before analysis is performed. merge
is a popular function in most data wrangling libraries. It is extremely similar and essentially equivalent to a JOIN
in SQL.
Read in each of the datasets into data.frames called: titles
, episodes
, people
, and ratings
.
Read the data in using the following code.
|
-
What are all the different listed genres (in the
titles
table)? -
Look at the
years
column and thegenres
column. In which year did the most comedies debut?
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Use the episode_title_id
column and the title_id
column from the episodes
and titles
data.frame’s (respectively) to merge the two data.frames.
Ultimately, we want to end up with a new data.frame that contains the primary_title
for every episodes in the episodes
table. Use the merge
function to accomplish this.
The |
This is also a really great explanation of merge in |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Use merge
(a few times) to create a new data.frame that contains at least the following information for only the show called "Friends". "Friends" (the show itself) has a title_id
of tt0108778. Each episode of Friends, has its own title_id
which contains the information for the specific episode as well.
-
The
primary_title
of the episode — call itepisode_title
. -
The
primary_title
of the show itself — call itshow_title
. -
The
rating
of the show itself — call itshow_rating
. -
The
rating
of the episode — call itepisode_rating
.
Start by getting a subset of the |
Show the top 5 rows of your final data.frame that contain the top 5 rated episodes.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Use regular old indexing to find all episodes of friends with an episode_rating
greater than 9 and season_number
of exactly 5.
Repeat the process, but this time use the subset
function instead.
-
Code used to solve this problem.
-
Output from running the code.
Question 5
subset
is a sometimes useful function that allows you to index data.frame’s in a less verbose manner. Read this.
While it maybe appears to be a clean way to subset data, I’d suggest avoiding it over explicit long-form indexing. Read this fantastic article by Dr. Hadley Wickham on non-standard evaluation. Take for example, the following (a bit contrived) example using the dataframe we got in question (3).
season_number = 6
results[results$episode_rating > 9 & results$season_number == season_number,]
subset(results, episode_rating > 9 & season_number == season_number)
Read that provided article and do your best to explain why subset
gets a different result than our example that uses regular indexing.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |