TidyTuesday: My 4-Hour Challenge

Apr 3, 2018 00:00 · 1421 words · 7 minute read R TidyTuesday tidyverse R4DS

In the first quarter of 2018, I focused my data science education on expanding my R programming skills and setting up this blog in Hugo. I developed my first R package, an API wrapper to the U.S. National Provider Identification (NPI) registry, and created an R-powered Power BI custom visual for a client.

Although I still plan to continue my work on these projects, I’m ready to start a new challenge. In my consulting work, I do a fair amount of data wrangling with SQL and visualization in Power BI. I’m eager to improve my skills in these areas in R, particularly since I learned most of what I know about R in grad school, before the modern tidyverse era.

The Challenge

Enter the #TidyTuesday challenge, a weekly data wrangling and visualization challenge organized by the energetic and inspiring R for Data Science (R4DS) online learning community, which I joined a few weeks ago. The challenge is to remake a visualization based on a publicly available dataset that has been cleaned up but not tidied. For more information about this challenge, check out the TidyTuesday GitHub repo.

This week, I’m limiting myself to spending only 4 hours on the TidyTuesday challenge, including writing and publishing this post. As time allows, I’ll refine the product and post the follow-up result. I’m doing this partly because I have other important commitments (hello family and full-time job!), but also because it creates conditions that are closer to real life.

Here are the two visualizations in this week’s challenge (original source):

Data Wrangling

Let’s load the packages we’ll need, as described in the code comments below.

library(tidyverse) # It's TidyTuesday after all!
## ── Attaching packages ────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.1     ✔ stringr 1.4.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'stringr' was built under R version 3.5.2
## ── Conflicts ───────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readxl)    # To read in Excel data the tidyverse way
library(viridis)   # For awesome, accessible color palettes
## Loading required package: viridisLite

Let’s read in the source data, which in this case is an Excel (.xlsx) file hosted on the TidyTuesday GitHub repo:

file_path <- "data/us_avg_tuition.xlsx"
r <- read_excel(file_path)
r
## # A tibble: 50 x 13
##    State       `2004-05` `2005-06` `2006-07` `2007-08` `2008-09` `2009-10`
##    <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Alabama         5683.     5841.     5753.     6008.     6475.     7189.
##  2 Alaska          4328.     4633.     4919.     5070.     5075.     5455.
##  3 Arizona         5138.     5416.     5481.     5682.     6058.     7263.
##  4 Arkansas        5772.     6082.     6232.     6415.     6417.     6627.
##  5 California      5286.     5528.     5335.     5672.     5898.     7259.
##  6 Colorado        4704.     5407.     5596.     6227.     6284.     6948.
##  7 Connecticut     7984.     8249.     8368.     8678.     8721.     9371.
##  8 Delaware        8353.     8611.     8682.     8946.     8995.     9987.
##  9 Florida         3848.     3924.     3888.     3879.     4150.     4783.
## 10 Georgia         4298.     4492.     4584.     4790.     4831.     5550.
## # ... with 40 more rows, and 6 more variables: `2010-11` <dbl>,
## #   `2011-12` <dbl>, `2012-13` <dbl>, `2013-14` <dbl>, `2014-15` <dbl>,
## #   `2015-16` <dbl>

It seems we have a mostly clean but untidy dataset. I’ll first do some light cleaning by renaming State to state to follow the tidy style guide, and then by turning the column into a factor with levels corresponding to the two-letter U.S. state codes from the built-in datasets package. This will be more space-efficient than using full state names.

r <- r %>% 
  rename(state = State) %>% 
  mutate(state_abb = factor(state.abb))
r
## # A tibble: 50 x 14
##    state       `2004-05` `2005-06` `2006-07` `2007-08` `2008-09` `2009-10`
##    <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Alabama         5683.     5841.     5753.     6008.     6475.     7189.
##  2 Alaska          4328.     4633.     4919.     5070.     5075.     5455.
##  3 Arizona         5138.     5416.     5481.     5682.     6058.     7263.
##  4 Arkansas        5772.     6082.     6232.     6415.     6417.     6627.
##  5 California      5286.     5528.     5335.     5672.     5898.     7259.
##  6 Colorado        4704.     5407.     5596.     6227.     6284.     6948.
##  7 Connecticut     7984.     8249.     8368.     8678.     8721.     9371.
##  8 Delaware        8353.     8611.     8682.     8946.     8995.     9987.
##  9 Florida         3848.     3924.     3888.     3879.     4150.     4783.
## 10 Georgia         4298.     4492.     4584.     4790.     4831.     5550.
## # ... with 40 more rows, and 7 more variables: `2010-11` <dbl>,
## #   `2011-12` <dbl>, `2012-13` <dbl>, `2013-14` <dbl>, `2014-15` <dbl>,
## #   `2015-16` <dbl>, state_abb <fct>

Now for the tidying. In this dataset, each value is an observation of the “average in-state tuition and fees for one year of full-time study at public institutions” for a given school year. But values belonging to different school years are spread out across columns, which violates tidy principles. We need to gather() these values and index them by year:

r %>% 
  gather(year, tuitfee, `2004-05`:`2015-16`)
## # A tibble: 600 x 4
##    state       state_abb year    tuitfee
##    <chr>       <fct>     <chr>     <dbl>
##  1 Alabama     AL        2004-05   5683.
##  2 Alaska      AK        2004-05   4328.
##  3 Arizona     AZ        2004-05   5138.
##  4 Arkansas    AR        2004-05   5772.
##  5 California  CA        2004-05   5286.
##  6 Colorado    CO        2004-05   4704.
##  7 Connecticut CT        2004-05   7984.
##  8 Delaware    DE        2004-05   8353.
##  9 Florida     FL        2004-05   3848.
## 10 Georgia     GA        2004-05   4298.
## # ... with 590 more rows

Ah, but wait! For the visualizations, we’ll ultimately need to have the “five-year percentage change in inflation-adjusted tuition and fees” for 2015-16. That’ll be a little easier to calculate now while the data is in wide format.

# Calculate the most recent 5-year % change in tuition and fees
r <- r %>%
  mutate(tuitfee_5yr_pct_chg = (`2015-16` - `2010-11`) / `2010-11` * 100) %>% 
  gather(year, tuitfee, `2004-05`:`2015-16`) %>% 
  select(state_abb, year, tuitfee, tuitfee_5yr_pct_chg) %>% 
  filter(year == "2015-16")
r
## # A tibble: 50 x 4
##    state_abb year    tuitfee tuitfee_5yr_pct_chg
##    <fct>     <chr>     <dbl>               <dbl>
##  1 AL        2015-16   9751.                20.8
##  2 AK        2015-16   6571.                14.1
##  3 AZ        2015-16  10646.                20.4
##  4 AR        2015-16   7867.                14.0
##  5 CA        2015-16   9270.                13.1
##  6 CO        2015-16   9748.                25.8
##  7 CT        2015-16  11397.                16.0
##  8 DE        2015-16  11676.                10.8
##  9 FL        2015-16   6360.                15.4
## 10 GA        2015-16   8447.                31.4
## # ... with 40 more rows

Now that’s a tidy data frame!

Visualization

We’re ready to start visualizing. I’ll begin with the bar chart at the bottom of the example since I don’t know as much about creating cloropleth maps. I’m going to redo this bar chart as a Cleveland-style dot plot because the latter provides the same information with less “ink” and simplifies the perceptual decoding task.

r %>% 
  ggplot() +
  geom_point(aes(color = tuitfee_5yr_pct_chg,
                 x = fct_reorder(state_abb, tuitfee),
                 y = tuitfee)) +
  ggtitle("Average tuition and fees in the United States",
          subtitle = "Average tuition and fees for one year of full-time study at public institutions, 2015-16") +
  xlab("State") +
  ylab("\nAverage Total Tuition and Fees") +
  labs(color = "5-Yr % Chg\n",
       caption = "Source: https://bit.ly/2EeBXpf") +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_viridis() +
  coord_flip() +
  theme_minimal()

Summary

Time is almost up, so here are some quick thoughts about what I’d do next if I had more time:

  • Modify the color scale to use a divergent scale centered at 0, so it’s more obvious when a state’s tuition has gone down versus up
  • Add a reference line or distinct geom to show the national average
  • Adjust the state axis so the state labels are more legible
  • Find a way to show more of the raw data – what if 2010-11 was an awkward year with special causes at play?
  • Learn how to create the map visualization

What I learned from this challenge:

  • Four hours seemed like a lot, but it’s not when you have to look up a lot of things online and write about it
  • Writing out the coding steps and rendering the results in an R Markdown document was fun and helped me figure out what I needed to do next
  • I have a lot to learn, especially when it comes to tweaking ggplot2 figures. To me, this is the most exciting part.

I hope to return to these lists and update the visualization in a later post. In the meantime, if you have any thoughts you’d like to share, I’d love to hear from you in the comments below or on Twitter.

Many thanks to the R4DS community for setting up the first of what I hope will be many future #TidyTuesday challenges!

tweet Share