STAT 39000: Project 10 — Spring 2021
Motivation: The use of a suite of packages referred to as the tidyverse
is popular with many R users. It is apparent just by looking at tidyverse
R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!
Context: We’ve covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the tidyverse
(including ggplot
) and data wrangling tasks.
Scope: R, tidyverse, ggplot
Make sure to read about, and use the template found here, and the important information about projects submissions here.
The tidyverse
consists of a variety of packages, including, but not limited to: ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
, and lubridate
.
One of the underlying premises of the tidyverse
is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s excellent book, R for Data Science.
There is an excellent graphic here that illustrates a general workflow for data science projects:
-
Import
-
Tidy
-
Iterate on, to gain understanding:
-
Transform
-
Visualize
-
Model
-
-
Communicate
This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/okcupid/filtered/*.csv
Questions
Question 1
Let’s (more or less) follow the guidelines given above. The first step is to import the data. There are two files: questions.csv
, and users.csv
. Read this section, and use what you learn to read in the two files into questions
and users
, respectively. Which functions from the tidyverse
did you use and why?
Its easy to load up the
|
Just because a file has the |
Make sure to print all |
|
-
R code used to solve the problem.
-
head
of each dataset,users
andquestions
. -
1 sentence explaining which functions you used (from
tidyverse
) and why.
Question 2
You may recall that the function read.csv
from base R reads data into a data.frame by default. In the tidyverse
, readr
functions read the data into a tibble
instead. Read this section. To summarize, some important features that are true for tibbles
but not necessarily for data.frames are:
-
Non-syntactic variable names (surrounded by backticks `` ` `` )
-
Never changes the type of the inputs (for example converting strings to factors)
-
More informative output from printing
-
No partial matching
-
Simple subsetting
Great, the next step in our outline is to make the data "tidy". Read this section. Okay, let’s say, for instance, that we wanted to create a tibble
with the following columns: user
, question
, question_text
, selected_option
, race
, gender2
, gender_orientation
, n
, and keywords
. As you can imagine, the "tidy" format, while great for analysis, would not be great for storage as there would be a row for each question for each user, at least. Columns like gender2
and race
don’t change for a user, so we end up with a lot of repeated values.
Okay, we don’t need to analyze all 68000 users at once, let’s instead, take a random sample of 2200 users, and create a "tidy" tibble
as described above. After all, we want to see why this format is useful! While trying to figure out how to do this may seem daunting at first, it is actually not so bad:
First, we convert the users
tibble to long form, so each row represents 1 answer to 1 questions from 1 user:
# Add an "id" columns to the users data
users$id <- 1:nrow(users)
# To ensure we get the same random sample, run the set.seed line
# before every time you run the following line
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
mutate_at(columns_to_pivot, as.character) %>% # This converts all of our columns in columns_to_pivot to strings
pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") # The old qXXXX columns are now values in the "question" column.
Next, we want to merge our data from the questions
tibble with our users_sample_long
tibble, into a new table we will call myDF
. How many rows and columns are in myDF
?
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")
-
R code used to solve the problem.
-
The number of rows and columns in
myDF
. -
The
head
ofmyDF
.
Question 3
Excellent! Now, we have a nice tidy dataset that we can work with. You may have noticed some odd syntax %>%
in the code provided in the previous question. %>%
is the piping operator in R added by the magittr
package. It works pretty much just like |
does in bash. It "feeds" the output from the previous bit of code to the next bit of code. It is extremely common practice to use this operator in the tidyverse
.
Observe the head
of myDF
. Notice how our question
column has the value d_age
, text
has the content "Age", and selected_option
(the column that shows the "answer" the user gave), has the actual age of the user. Wouldn’t it be better if our myDF
had a new column called age
instead of age
being an answer to a question?
Modify the code provided in question 2 so age
ends up being a column in myDF
with the value being the actual age of the user.
Pay close attention to |
You can make a single modification to 1 line to accomplish this. Pay close attention to the
|
-
R code used to solve the problem.
-
The number of rows and columns in
myDF
. -
The
head
ofmyDF
.
Question 4
Wow! That is pretty powerful! Okay, it is clear that there are question questions, where the column starts with "q", and other questions, where the column starts with something else. Modify question (3) so all of the questions that don’t start with "q" have their own column in myDF
. Like before, show the number of rows and columns for the new myDF
, as well as print the head
.
-
R code used to solve the problem.
-
The number of rows and columns in
myDF
. -
The
head
ofmyDF
.
Question 5
It seems like we’ve spent the majority of the project just wrangling our dataset — that is normal! You’d be incredibly lucky to work in an environment where you recieve data in a nice, neat, perfect format. Let’s do a couple basic operations now, to practice.
mutate
is a powerful function in dplyr
, that is not easy to mimic in Python’s pandas
package. mutate
adds new columns to your tibble, while preserving your existing columns. It doesn’t sound very powerful, but it is.
Use mutate to create a new column called generation
. generation
should contain "Gen Z" for ages [0, 24], "Millenial" for ages [25-40], "Gen X" for ages [41-56], and "Boomers II" for ages [57-66], and "Older" for all other ages.
-
R code used to solve the problem.
-
The number of rows and columns in
myDF
. -
The
head
ofmyDF
.
Question 6
Use ggplot
to create a scatterplot showing d_age
on the x-axis, and lf_min_age
on the y-axis. lf_min_age
is the minimum age a user is okay dating. Color the points based on gender2
. Add a proper title, and labels for the X and Y axes. Use alpha=.6
.
This may take quite a few minutes to create. Before creating a plot with the entire |
-
R code used to solve the problem.
-
Output from running your code.
-
The plot produced.