STAT 39000: Project 1 — Fall 2020

Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.

Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we’ve previously learned.

Scope: data wrangling in R, functions

Learning objectives
  • Comprehend what a function is, and the components of a function in R.

  • Read and write basic (csv) data.

  • Utilize apply functions in order to solve a data-driven problem.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

You can find useful examples that walk you through relevant material in The Examples Book:

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

It is highly recommended that you use rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.

We decided to move away from ThinLinc and away from the version of RStudio used last year (desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks.

Remember the very useful documentation shortcut ?. To use, simply type ? in the console, followed by the name of the function you are interested in.

You can also look for package documentation by using help(package=PACKAGENAME), so for example, to see the documentation for the package ggplot2, we could run:

help(package=ggplot2)

Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R or c or c++ code that is used to create the function. To see the source code of a defined function, type the function’s name without the (). For example, if we were curious about what the function Reduce does, we could run:

Reduce

Occasionally this will be less useful as the resulting code will be code that calls c code we can’t see. Other times it will allow you to understand the function better.

Dataset:

/class/datamine/data/airbnb

Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb.

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz, listings.csv.gz, and reviews.csv.gz (for now, we will ignore all files in the "visualisations" folders).

Let’s take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz, listings.csv.gz, and reviews.csv.gz). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.

read.csv has an argument to select the number of rows we want to read.

Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.

To read a compressed csv, simply use the read.csv function:

dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)

Let’s work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz datasets.

Items to submit
  • Chunk of code used to read the first 50 rows of each dataset.

  • 1-2 sentences briefly describing the information contained in each dataset.

  • Name(s) of variable(s) that could be used to join them.

Question 2

Write a function called get_paths_for_country, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz files, starting with /class/datamine/data/airbnb/…​.

For example, the output from get_paths_for_country("united-states") should have 28 entries. Here are the first 5 entries in the output:

 [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"
 [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"
 [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"
 [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"
 [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"

list.files is useful with the recursive=T option.

Use grep to search for the pattern listings.csv.gz (within the results from the first hint), and use the option value=T to display the values found by the grep function.

Items to submit
  • Chunk of code for your get_paths_for_country function.

Question 3

Write a function called get_data_for_country that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.

Use stringsAsFactors=F in the read.csv function.

Use do.call(rbind, <listofdataframes>) to combine a list of dataframes into a single dataframe.

Items to submit
  • Chunk of code for your get_data_for_country function.

Question 4

Use your get_data_for_country to get the data for a country of your choice, and make sure to name the data.frame listings. Take a look at the following columns: host_is_superhost, host_has_profile_pic, host_identity_verified, and is_location_exact. What is the data type for each column? (You can use class or typeof or str to see the data type.)

These columns would make more sense as logical values (TRUE/FALSE/NA).

Write a function called transform_column that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""), and we need to be careful when transforming the data. Test your function on column host_is_superhost.

Items to submit
  • Chunk of code for your transform_column function.

  • Type of transform_column(listings$host_is_superhost).

Question 5

Create a histogram for response rates (host_response_rate) for super hosts (where host_is_superhost is TRUE). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert host_response_rate from a character containing "%" signs to a numeric variable.

Items to submit
  • Chunk of code used to answer the question.

  • Histogram of response rates for super hosts.