--- title: "fivethirtyeight Package" author: "Albert Y. Kim, Chester Ismay, and Jennifer Chunn" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: | %\VignetteIndexEntry{fivethirtyeight Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, message=FALSE, warning=FALSE, echo=FALSE} library(fivethirtyeight) library(ggplot2) library(dplyr) library(readr) library(knitr) library(tibble) # Pull all dataset names all_datasets <- datasets_master %>% pull(`Data Frame Name`) %>% unique() # Pull all fivethirtyeightdata dataset names all_fivethirtyeightdata_datasets <- datasets_master %>% filter(`In fivethirtyeightdata?` == "Y") %>% pull(`Data Frame Name`) %>% unique() %>% sort() if(FALSE){ # Get data set names as listed in pkg pkg_data_list <- data(package = "fivethirtyeightdata")[["results"]] %>% as_tibble() %>% pull(Item) %>% sort() # This should yield TRUE identical(all_fivethirtyeightdata_datasets, pkg_data_list) } # Pull all fivethirtyeight dataset names all_fivethirtyeight_datasets <- datasets_master %>% filter(is.na(`In fivethirtyeightdata?`)) %>% pull(`Data Frame Name`) %>% unique() %>% sort() if(FALSE){ # Get data set names as listed in pkg pkg_data_list <- data(package = "fivethirtyeight")[["results"]] %>% as_tibble() %>% filter(Item != "datasets_master") %>% pull(Item) %>% sort() # This should yield TRUE identical(all_fivethirtyeight_datasets, pkg_data_list) } ``` ## Acknowledgment We are aware of [this tweet](https://twitter.com/MonaChalabi/status/1270715883531300866){target="_blank"} by Mona Chalabi. Although, we have not yet decided the future of the `fivethirtyeight` package (and subsequently, the `fivethirtyeightdata` package), we re-iterate that this package is not officially published by 538. ## Note on large datasets There are `r all_fivethirtyeight_datasets %>% length()` datasets included in the `fivethirtyeight` package. However, there are also `r all_fivethirtyeightdata_datasets %>% length()` datasets that could not be included in `fivethirtyeight` due to CRAN package size restrictions: ```{r, message=FALSE, warning=FALSE, echo=FALSE} all_fivethirtyeightdata_datasets ``` These `r all_fivethirtyeightdata_datasets %>% length()` datasets are included in the `fivethirtyeightdata` add-on package^[The `fivethirtyeightdata` package is hosted via a [`drat` repository](https://CRAN.R-project.org/package=drat/vignettes/DratForPackageAuthors.html){target="_blank"}], which you can install by running: ```{r, eval=FALSE} install.packages('fivethirtyeightdata', repos = 'https://fivethirtyeightdata.github.io/drat/', type = 'source') ``` So for example, to load the `senators` dataset, run: ```{r, eval = FALSE} library(fivethirtyeight) library(fivethirtyeightdata) senators ``` ## All datasets All `r all_fivethirtyeight_datasets %>% length()` + `r all_fivethirtyeightdata_datasets %>% length()` = `r all_datasets %>% length()` datasets between the `fivethirtyeight` and `fivethirtyeightdata` packages are listed here. ```{r, message=FALSE, warning=FALSE, echo=FALSE} datasets_master %>% mutate(`Data Frame Name` = paste("`", `Data Frame Name`, "`", sep=""), `In fivethirtyeightdata?` = ifelse(is.na(`In fivethirtyeightdata?`), "", "Yes")) %>% kable() ``` ## Motivation The motivation for creating this package is articulated in [The fivethirtyeight R Package: "Tame Data" Principles for Introductory Statistics and Data Science Courses](https://escholarship.org/uc/item/0rx1231m) by Kim, Ismay, and Chunn (2018) published in Volume 11, Issue 1 of the journal "Technology Innovations in Statistics Education". Here is an executive summary. We are involved in statistics and data science education, in particular at the introductory undergraduate level. As such, we are always looking for data sets that balance being: 1. **Rich enough** to answer meaningful questions with, **real enough** to ensure that there is context, and **realistic enough** to convey to students that data as it exists "in the wild" often needs processing. 1. Easily and quickly accessible to novices, so that we [minimize the prerequisites to research](https://arxiv.org/abs/1507.05346). It has been our experience that many data sets that exist in R packages, such as the [`nycflights13`](https://github.com/tidyverse/nycflights13), [`babynames`](https://github.com/hadley/babynames), and [`gapminder`](https://github.com/jennybc/gapminder) packages, are of great pedagogical value as they: * Satisfy the above two goals. * Are in standardized format as they fit into the [tidy tools](https://CRAN.R-project.org/package=tidyverse/vignettes/manifesto.html) ecosystem. * Are really fun to play with! It is along these lines that we present `fivethirtyeight`: an R package of data and code behind the stories and interactives at [FiveThirtyEight.com](https://fivethirtyeight.com/), a data-driven journalism website founded by Nate Silver and owned by ESPN. FiveThirtyEight has been very forward thinking in making the data used in many of their articles open and accessible on [GitHub](https://github.com/fivethirtyeight/data), a web-based repository for collaboration on code and data. With consultation from [Andrew Flowers](https://fivethirtyeight.com/contributors/andrew-flowers/) and [Andrei Scheinkman](https://fivethirtyeight.com/contributors/andrei-scheinkman/) of FiveThirtyEight, we go one step further by: 1. Doing just enough pre-processing (i.e. data "taming") so that statistics and data science novices can sink their teeth into the data right away. 2. Packaging it all in an easy to load format: package installation instead of working with CSV files. 3. Providing easily accessible documentation: The help file for each data set includes a thorough description of the observational unit and all variables, a link to the original article, and (if listed) the data sources. ## "Tame" data principles In order to make the data easily accessible to R novices, we pre-process the original data sets as they exist in the [538 GitHub repository](https://github.com/fivethirtyeight/data) to adhere to the following "tame" data guidelines: 1. **Naming conventions for data frame and variable names**: 1. Whenever possible, all names should be no more than 20 characters long. Exceptions to this rule exist when shortening the names to less than 20 characters would lead to a loss of information. 1. Use only lower case characters and replace all spaces with underscores. This format is known as `snake_case` and is an alternative to `camelCase`, where successive words are delineated with upper case characters. 1. In the case of variable (column) names within a data frame, use underscores instead of spaces. 1. **Variables identifying observational units**: 1. Any variables uniquely identifying each observational unit should be in the left-hand columns. 1. **Dates**: 1. If only a `year` variable exists, then it should be represented as a numerical variable. 1. If there are `year` and `month` variables, then convert them to `Date` objects as `year-month-01`. In other words, associate all observations from the same month to have a `day` of `01` so that a correct `Date` object can be assigned. 1. If there are `year`, `month`, and `day` variables, then convert them to `Date` objects as `year-month-day`. 1. **Ordered Factors, Factors, Characters, and Logicals**: 1. Ordinal categorical variables are represented as `ordered` factors. 1. Categorical variables with a fixed and known set of levels are represented as regular `factor`s. 1. Categorical variables whose possible levels are either unknown or of a very large number are represented as `character`s. 1. Any "yes/no" character encoding of binary variables is converted to `TRUE/FALSE` logical variables. 1. **Tidy data format**: 1. Whenever possible, save all data frames in "tidy" data format as defined by Hadley Wickham: a) Each variable forms a column. a) Each observation forms a row. a) Each type of observational unit forms a table. 1. If converting the raw data to "tidy" data format alters the dataset too much, then make the code to convert to tidy format easily accessible. **Note**: The code used to pre-process the data can be found on the [GitHub repository](https://github.com/rudeboybert/fivethirtyeight/tree/master/data-raw) for the package in the `process_data_sets.R` files. These can serve as data manipulation/wrangling examples and exercises for more advanced students.