Monday, February 29, 2016

Data Wrangling with dplyr and tidyr

For this workshop we will be discussing the dplyr and tidyr packages in R.

They are designed to help you to restructure, rearrange, aggregate or transform your data.

There are very nice online tutorials to help learn these packages further:

dplyr

dplyr + tidyr

Getting dplyr and tidyr installed and loaded

#install.packages(c("dplyr","tidyr"))
library(dplyr)
library(tidyr)

Example of using dplyr with baseball data

baseball <- read.csv("http://kmaurer.github.io/documents/baseball.csv")
head(baseball)
##          id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb
## 1 ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA
## 2 forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA
## 3 mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA
## 4 startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA
## 5 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA
## 6 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA
##   hbp sh sf gidp
## 1  NA NA NA   NA
## 2  NA NA NA   NA
## 3  NA NA NA   NA
## 4  NA NA NA   NA
## 5  NA NA NA   NA
## 6  NA NA NA   NA

Example of using dplyr with baseball data

Leap into RStudio to build script from scratch!

Your Turn!

  • Create a subset of only data from players after 2005
  • Create new column for batting average (hits / at bats)
  • Calculate the total number of hits for every player over their entire career
  • Calculate the record homeruns by one player in a season for each team

Example of using tidyr with french fry data

data(french_fries, package="reshape2")
head(french_fries)
##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3

Example of using tidyr with french fry data

Leap back over to RStudio to continue this example!

Visualization with ggplot2

ggplot2 is a visualization package in R that supports many plot types and structures.

Built based on the idea of "the grammar of graphics"

#install.packages(c("ggplot2"))
library(ggplot2)

Example of using ggplot2 with diamonds data

The diamonds data set was included in when you loaded the ggplot2 library

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

ggplot2 with diamonds data

ggplot() +
  geom_point(aes(x=carat,y=price,color=cut), data=diamonds)

ggplot2 with diamonds data

Back to RStudio once again for exploration!