DPI R Bootcamp

Jared Knowles

In this lesson we hope to learn:

- Aggregating data
- Organizing our data
- Manipulating vectors
- Dealing with missing data

```
# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
load("data/smalldata.rda")
# Note if we don't assign data to 'df' R just prints contents of
# table
```

- Sometimes we need to do some basic checking for the number of observations or types of observations in our dataset
- To do this quickly and easily–the
`table`

function is our friend - Let’s look at our observations by year and grade

`table(df$grade, df$year)`

```
##
## 2000 2001 2002
## 3 200 100 200
## 4 100 200 100
## 5 200 100 200
## 6 100 200 100
## 7 200 100 200
## 8 100 200 100
```

- The first command gives the rows, the second gives the columns
- Ugly, but effective

- Let’s aggregate by race and year

`table(df$year, df$race)`

```
##
## A B H I W
## 2000 16 370 93 7 414
## 2001 16 370 93 7 414
## 2002 16 370 93 7 414
```

- Race is consistent across years, interesting
- What if we want to only look at 3rd graders that year?

```
with(df[df$grade == 3, ], {
table(year, race)
})
```

```
## race
## year A B H I W
## 2000 4 78 22 4 92
## 2001 1 44 8 2 45
## 2002 0 74 20 1 105
```

`with`

specifies a data object to work on, in this case all elements of`df`

where`grade==3`

`table`

is the same command as above, but since we specified the data object in the`with`

statement, we don’t need the`df$`

in front of the variables of interest

- Can you find the number of black students in each grade in each year?
- hint:
`with(df[df$___==___,]...)`

- How many in year 2002, grade 6?
- 48
- How many in 2001, grade 7?
- 39

```
with(df[df$race == "B", ], {
table(year, grade)
})
```

```
## grade
## year 3 4 5 6 7 8
## 2000 78 48 87 39 74 44
## 2001 44 78 48 87 39 74
## 2002 74 44 78 48 87 39
```

- This is really powerful for looking at the descriptive dimensions of the data, we can ask questions like:
- how many students are at each proficiency level each year?

`table(df$year, df$proflvl)`

```
##
## advanced basic below basic proficient
## 2000 56 313 143 388
## 2001 229 183 64 424
## 2002 503 27 3 367
```

- how many students are at each proficiency level by race?

`table(df$race, df$proflvl)`

```
##
## advanced basic below basic proficient
## A 19 7 3 19
## B 160 302 162 486
## H 54 76 33 116
## I 7 4 1 9
## W 548 134 11 549
```

- What if we aren’t interested in counts?
- R makes it really easy to calculate proportions

`prop.table(table(df$race, df$proflvl))`

```
##
## advanced basic below basic proficient
## A 0.0070370 0.0025926 0.0011111 0.0070370
## B 0.0592593 0.1118519 0.0600000 0.1800000
## H 0.0200000 0.0281481 0.0122222 0.0429630
## I 0.0025926 0.0014815 0.0003704 0.0033333
## W 0.2029630 0.0496296 0.0040741 0.2033333
```

- Hmmm, this is goofy. This tells us the proportion of each cell out of the total. Also, the digits are distracting. How can we fix this?

`round(prop.table(table(df$race, df$proflvl), 1), digits = 3)`

```
##
## advanced basic below basic proficient
## A 0.396 0.146 0.062 0.396
## B 0.144 0.272 0.146 0.438
## H 0.194 0.272 0.118 0.416
## I 0.333 0.190 0.048 0.429
## W 0.441 0.108 0.009 0.442
```

- The
`1`

tells R we want proportions rowise, a`2`

goes columnwise `round`

tells R to cut off some digits for us- Proportions are just that, not in percentage terms (we need to multiply by 100 for this)
- Can you make this table express percentages instead of proportions? How might that code look?
- A few more problems arise–this pools all observations, including students across years

- One of the most common questions will be to compute aggregates of data
- R has an
`aggregate`

function that can be used and helps us avoid the clustering problems above - This works great for simple aggregation like scale score by race, we just need a
`formula`

(think I want variable X**by**grouping factor Y) and the statistic we want to compute

```
# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)
```

```
## race readSS
## 1 A 508.7
## 2 B 460.2
## 3 H 473.2
## 4 I 485.2
## 5 W 533.2
```

`aggregate`

can take us a little further, we can use aggregate multiple variables at a time

`aggregate(cbind(readSS, mathSS) ~ race, data = df, mean)`

```
## race readSS mathSS
## 1 A 508.7 477.9
## 2 B 460.2 442.5
## 3 H 473.2 442.7
## 4 I 485.2 455.9
## 5 W 533.2 529.8
```

- We can add multiple grouping varialbes using the
`formula`

syntax

```
head(aggregate(cbind(readSS, mathSS) ~ race + grade, data = df, mean),
8)
```

```
## race grade readSS mathSS
## 1 A 3 397.8 454.8
## 2 B 3 409.8 371.6
## 3 H 3 417.7 364.2
## 4 I 3 407.6 449.3
## 5 W 3 481.1 450.7
## 6 A 4 456.0 438.2
## 7 B 4 426.9 408.1
## 8 H 4 418.8 404.6
```

`head`

or`tail`

just tell R to present us the firxt**X**items in an object, in this case**8**objects

- We can build a systematic cross-tab now

```
ag <- aggregate(readSS ~ race + grade, data = df, mean)
xtabs(readSS ~ ., data = ag)
```

```
## grade
## race 3 4 5 6 7 8
## A 397.8 456.0 479.1 539.5 600.4 605.3
## B 409.8 426.9 447.6 470.9 492.3 523.5
## H 417.7 418.8 481.2 489.1 500.3 534.2
## I 407.6 531.1 547.6 0.0 405.5 518.0
## W 481.1 498.5 517.1 546.6 565.2 596.1
```

- And prettier output

`ftable(xtabs(readSS ~ ., data = ag))`

```
## grade 3 4 5 6 7 8
## race
## A 397.8 456.0 479.1 539.5 600.4 605.3
## B 409.8 426.9 447.6 470.9 492.3 523.5
## H 417.7 418.8 481.2 489.1 500.3 534.2
## I 407.6 531.1 547.6 0.0 405.5 518.0
## W 481.1 498.5 517.1 546.6 565.2 596.1
```

What is the mean reading score for 6th grade students with disabilities?

**481.83**How many points is this from non-disabled students?

**29.877**

`aggregate(cbind(readSS, mathSS) ~ disab + grade, data = df, mean)`

```
## disab grade readSS mathSS
## 1 0 3 449.9 418.3
## 2 1 3 421.1 376.3
## 3 0 4 464.0 454.2
## 4 1 4 438.2 425.1
## 5 0 5 484.9 470.2
## 6 1 5 475.1 431.0
## 7 0 6 511.7 507.9
## 8 1 6 481.8 476.9
## 9 0 7 532.0 532.0
## 10 1 7 516.1 474.3
## 11 0 8 567.6 567.7
## 12 1 8 518.8 534.1
```

- Consider the case we want to turn our student level data into school level data
- Who hasn’t had to do this?!?
- In
`aggregate`

we do:

```
z <- aggregate(readSS ~ dist, FUN = mean, data = df)
z
```

```
## dist readSS
## 1 205 496.5
## 2 402 500.5
## 3 495 491.6
```

- But I want more! I want to aggregate multiple variables. I want to do it across multiple groups. I want the output to be a dataframe I can work on.
- Thank you
`plyr`

`aggregate`

is cool, but it isn’t very flexible- We can only use aggregate output as a table, which we have to convert to a data frame
- There is a better way; the
`plyr`

package `plyr`

is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another type- We will focus here on summarizing and aggregating a data frame, but later in the bootcamp we’ll apply functions to lists and turn lists into data frames as well
- This is cool!

- In R this is known as “split, apply, and combine”
- Why? First, we
**split**the data into groups by - Then we
**apply**some function or another to that group (i.e. count the unique values of a variable) - Then we
**combine**the data back together - This has some advantages–unlike other methods, the data does not have to be ordered by our id variable for this to work
- The disadvantage is that this method is computationally expensive, even in R, and requires copying our data frame using up RAM

- The
`plyr`

package has a number of utilities to help us split-apply-combine across data types for both input and output - In R we can’t just use
`for`

loops to iterate over groups of students, because in R`for`

loops are slow, inefficient, and impractical `plyr`

to the rescue, while not as fast as a compiled language, it is pretty dang good!- And still readable

- This shows how the dataframe is broken up into pieces and each piece then gets whatever functions, summaries, or transformations we apply to it

- And this shows the output
`ddply`

has before it combines it back for us when we do the call`ddply(df,.(sex,age),"nrow")`

`plyr`

has a straightforward syntax- All
`plyr`

functions are in the format**XX**ply. The two X’s specify what the input file we are applying a function to is, and then what way we would like it outputted. - In
`plyr`

d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is`ddply`

- From a dataframe, to a dataframe.
- We will see more of
`plyr`

in Tutorial 4 as well

```
library(plyr)
myag <- ddply(df, .(dist, grade), summarize, mean_read = mean(readSS, na.rm = T),
mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, na.rm = T),
sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), count_math = length(mathSS))
head(myag)
```

```
## dist grade mean_read mean_math sd_read sd_math count_read
## 1 205 3 451.7 406.1 93.52 72.45 200
## 2 205 4 438.9 459.9 77.76 79.10 100
## 3 205 5 487.9 462.6 85.30 75.10 200
## 4 205 6 514.7 526.8 76.83 66.04 100
## 5 205 7 530.0 521.5 84.82 74.85 200
## 6 205 8 575.5 581.2 79.58 83.45 100
## count_math
## 1 200
## 2 100
## 3 200
## 4 100
## 5 200
## 6 100
```

- This looks complex, but it only has a few components.
- The first argument is the dataframe we are working on, the next argument is the level of identification we want to aggregate to
`summarize`

tells`ddply`

what we are doing to the data frame- Then we make a list of new variable names, and how to calculate them on each of the subsets in our large data frame
- That’s it!

- This is great, we can quickly build a summary dataset from individual records
- A few advanced tricks. How do we build counts and percentages into our dataset?

```
myag <- ddply(df, .(dist, grade), summarize, mean_read = mean(readSS, na.rm = T),
mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS, na.rm = T),
sd_math = sd(mathSS, na.rm = T), count_read = length(readSS), count_math = length(mathSS),
count_black = length(race[race == "B"]), per_black = length(race[race ==
"B"])/length(readSS))
summary(myag[, 7:10])
```

```
## count_read count_math count_black per_black
## Min. :100 Min. :100 Min. :36.0 Min. :0.360
## 1st Qu.:100 1st Qu.:100 1st Qu.:41.2 1st Qu.:0.386
## Median :150 Median :150 Median :60.5 Median :0.412
## Mean :150 Mean :150 Mean :61.7 Mean :0.411
## 3rd Qu.:200 3rd Qu.:200 3rd Qu.:82.0 3rd Qu.:0.439
## Max. :200 Max. :200 Max. :92.0 Max. :0.480
```

- There is an alternate package to plyr called
`data.table`

which is really handy - It allows SQL like querying of R data frames
- It is incredibly fast
- It will be incorporated into the next
`plyr`

version - You can read up on it online

- What if we want to compare how districts do on educating ELL students?
What district ID has the highest mean score for 4th grade ELL students on reading? Math?

- 66 in reading, 105 in math
How many students are in these classes?

12 and 7 respectively

```
myag2 <- ddply(df, .(dist, grade, ell), summarize, mean_read = mean(readSS,
na.rm = T), mean_math = mean(mathSS, na.rm = T), sd_read = sd(readSS,
na.rm = T), sd_math = sd(mathSS, na.rm = T), count_read = length(readSS),
count_math = length(mathSS), count_black = length(race[race == "B"]),
per_black = length(race[race == "B"])/length(readSS))
subset(myag2, ell == 1 & grade == 4)
```

```
## dist grade ell mean_read mean_math sd_read sd_math count_read
## 4 205 4 1 403.0 392.9 64.52 39.09 16
## 16 402 4 1 443.1 388.7 79.52 53.28 29
## 28 495 4 1 408.8 431.9 77.47 70.77 13
## count_math count_black per_black
## 4 16 2 0.12500
## 16 29 6 0.20690
## 28 13 1 0.07692
```

- A key way to explore data in tabular form is to sort data
- Sorting data in R can be dangerous as you can reorder the vectors of a dataframe
- We use the
`order`

function to sort data

```
df.badsort <- order(df$readSS, df$mathSS)
head(df.badsort)
```

`## [1] 106 1026 2 56 122 118`

- Why is this wrong?
- What is R giving us?
- Rownames…

- To fix it, we need to tell R to reorder the rownames in the order we want

```
df.sort <- df[order(df$readSS, df$mathSS, df$attday), ]
head(df[, c(3, 23, 29, 30)])
```

```
## stuid attday readSS mathSS
## 1 149995 180 357.3 387.3
## 2 13495 180 263.9 302.6
## 3 106495 160 369.7 365.5
## 4 45205 168 346.6 344.5
## 5 142705 156 373.1 441.2
## 6 14995 157 436.8 463.4
```

`head(df.sort[, c(3, 23, 29, 30)])`

```
## stuid attday readSS mathSS
## 106 106705 160 251.5 277.0
## 1026 80995 176 263.2 377.8
## 2 13495 180 263.9 302.6
## 56 122402 180 264.3 271.7
## 122 79705 168 266.4 318.7
## 118 40495 173 266.9 275.0
```

`head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])`

```
## stuid attday readSS mathSS
## 1631 145205 137 833.2 828.4
## 1462 107705 180 773.3 746.6
## 2252 122902 180 744.0 621.6
## 2341 44902 175 741.7 676.3
## 1482 134705 180 739.2 705.4
## 1630 14495 162 738.9 758.2
```

- Here we find the high performing students, note that the
`-`

denotes we want descending order, R’s default is ascending order - This is easy to correct

- Sorting works differently on some data types like matrices

```
M <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL,
c("a", "b")))
M[order(M[, "a"], -M[, "b"]), ]
```

```
## a b
## [1,] 1 3
## [2,] 2 6
## [3,] 2 5
## [4,] 2 4
```

- Tables are familiar

```
mytab <- table(df$grade, df$year)
mytab[order(mytab[, 1]), ]
```

```
##
## 2000 2001 2002
## 4 100 200 100
## 6 100 200 100
## 8 100 200 100
## 3 200 100 200
## 5 200 100 200
## 7 200 100 200
```

`mytab[order(mytab[, 2]), ]`

```
##
## 2000 2001 2002
## 3 200 100 200
## 5 200 100 200
## 7 200 100 200
## 4 100 200 100
## 6 100 200 100
## 8 100 200 100
```

- Filtering data is an incredibly powerful feature and we have already seen it used to do some interesting things
- Filtering data in R is loaded with trouble though, because the filtering arguments must be very carefully specified
- Filtering is like a mini-sort
- Always, always, always check your work
- And remember, this is the place the NAs do the most damage
- Let’s look at some examples

```
# Gives all rows that meet this requirement
df[df$readSS > 800, ]
```

```
## X school stuid grade schid dist white black hisp indian
## 1631 1281061 852 145205 8 15 205 1 0 0 0
## asian econ female ell disab sch_fay dist_fay luck ability
## 1631 0 0 1 0 0 0 0 0 108.3
## measerr teachq year attday schoolscore district schoolhigh
## 1631 6.325 155.7 2001 137 227.7 19 0
## schoolavg schoollow readSS mathSS proflvl race
## 1631 1 0 833.2 828.4 advanced W
```

`df$grade[df$mathSS > 800]`

`## [1] 8`

`# Gives all values of grade that meet this requirement`

- This seems basic enough, let’s filter on multiple dimensions
- Before the brackets we specify what we want returned, and within the brackets we present the logical expression to evaluate
- Behind the scenes R does a logical test and gets the row numbers that match the logical expression
- It then combines them back with the object in front of the brackets to return the values
- So great we don’t have to do that!

`df$grade[df$black == 1 & df$readSS > 650]`

`## [1] 8 7 8 6 6 7 8 7 8 8 8 4`

- What happens if we type
`df$black=1`

or`black==1`

? - Why won’t this work?

- We can also use filters to assign values as well
- This is how you recode variables and create new ones
- Let’s create a variable
`spread`

indicating whether a district has high or low spread among its student scores

```
myag$spread <- NA # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
summary(myag$spread)
```

```
## high low
## 15 3
```

- The previous block of code is a useful way to learn how to recode variables

```
myag$spread <- NA # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
```

- Create a new variable in
`myag`

called`schoolperf`

for`mean_math`

scores with the following coding scheme:

Grade | Score Range | Code |
---|---|---|

3 | >425 | “Hi” |

4 | >450 | “Hi” |

5 | >475 | “Hi” |

6 | >500 | “Hi” |

7 | >525 | “Hi” |

8 | >575 | “Hi” |

- All other values are coded as “low”
- How many “high” and “low” observations do we have?
- By
`dist`

?

```
myag$schoolperf <- "lo"
myag$schoolperf[myag$grade == 3 & myag$mean_math > 425] <- "hi"
myag$schoolperf[myag$grade == 4 & myag$mean_math > 450] <- "hi"
myag$schoolperf[myag$grade == 5 & myag$mean_math > 475] <- "hi"
myag$schoolperf[myag$grade == 6 & myag$mean_math > 500] <- "hi"
myag$schoolperf[myag$grade == 7 & myag$mean_math > 525] <- "hi"
myag$schoolperf[myag$grade == 8 & myag$mean_math > 575] <- "hi"
myag$schoolperf <- as.factor(myag$schoolperf)
summary(myag$schoolperf)
```

```
## hi lo
## 9 9
```

`table(myag$dist, myag$schoolperf)`

```
##
## hi lo
## 205 3 3
## 402 3 3
## 495 3 3
```

- For district 6 let’s negate the grade 3 scores by replacing them with missing data

```
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- NA
head(myag[, 1:4], 2)
```

```
## dist grade mean_read mean_math
## 1 205 3 451.7 406.1
## 2 205 4 438.9 459.9
```

- Let’s replace one data element with another

```
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- myag$mean_read[myag$dist ==
6 & myag$grade == 4]
head(myag[, 1:4], 2)
```

```
## dist grade mean_read mean_math
## 1 205 3 451.7 406.1
## 2 205 4 438.9 459.9
```

- Voila

- Let’s consider the case above but insert some NA values for all 3rd grade tests

```
myag$mean_read[myag$grade == 3] <- NA
head(myag[order(myag$grade), 1:4])
```

```
## dist grade mean_read mean_math
## 1 205 3 NA 406.1
## 7 402 3 NA 431.9
## 13 495 3 NA 405.5
## 2 205 4 438.9 459.9
## 8 402 4 474.9 432.8
## 14 495 4 447.8 469.1
```

- Now let’s calculate a few statistics:

`mean(myag$mean_math)`

`## [1] 490.7`

`mean(myag$mean_read)`

`## [1] NA`

- Remember, NA values propogate, so R assumes an NA value could take literally any value, and as such it is impossible to know the
`mean`

of a vector with NA - We can override this though:

`mean(myag$mean_math, na.rm = T)`

`## [1] 490.7`

`mean(myag$mean_read, na.rm = T)`

`## [1] 507.5`

- But for other problems it is tricky
- What if we want to know the number of rows that have a
`mean_read`

of less than 500?

`length(myag$dist[myag$mean_read < 500])`

`## [1] 10`

`head(myag$mean_read[myag$mean_read < 500])`

`## [1] NA 438.9 487.9 NA 474.9 472.5`

- And what if we want to add the standard deviation to these vectors?

```
badvar <- myag$mean_read + myag$sd_read
summary(badvar)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 517 566 592 587 614 655 3
```

- Consider the case where two sets of variables have different missing elements

```
myag$sd_read[myag$count_read < 100 & myag$mean_read < 550] <- NA
length(myag$mean_read[myag$mean_read < 550])
```

`## [1] 16`

`length(myag$mean_read[myag$mean_read < 550 & !is.na(myag$mean_read)])`

`## [1] 13`

- What is
`!is.na()`

? `is.na()`

is a helpful function to identify TRUE if a value is missing`!`

is the reverse operator- We are asking R if this value is not a missing value, and to only give us non-missing values back

- It is unlikely all the data we will want resides in a single dataset and often we have to combine data from several sources
- R makes this easy, but that simplicity comes at a cost–it can be easy to make mistakes if you don’t specify things carefully
- Let’s merge attributes about a student’s school with the student row data
- We might want to do that if we want to evaluate the performance of students in different school climates, and school climate was measured in part by the mean performance

- We have two data objects
`df`

which has multiple rows per student and`myag`

which has multiple rows per school - What are the variables that
**link**these two together?

`names(myag)`

```
## [1] "dist" "grade" "mean_read" "mean_math"
## [5] "sd_read" "sd_math" "count_read" "count_math"
## [9] "count_black" "per_black" "spread" "schoolperf"
```

`names(df[, c(2, 3, 4, 6)])`

`## [1] "school" "stuid" "grade" "dist"`

- It looks like
`dist`

and`grade`

are in common. Is this ok? - Why might we want to consider re-aggregating with
`year`

as well? - For this example we won’t just yet

- We have a few options with
`merge`

we want to consider with`?merge`

- In the simple case we let
`merge`

**automagically**

```
simple_merge <- merge(df, myag)
names(simple_merge)
```

```
## [1] "grade" "dist" "X" "school"
## [5] "stuid" "schid" "white" "black"
## [9] "hisp" "indian" "asian" "econ"
## [13] "female" "ell" "disab" "sch_fay"
## [17] "dist_fay" "luck" "ability" "measerr"
## [21] "teachq" "year" "attday" "schoolscore"
## [25] "district" "schoolhigh" "schoolavg" "schoollow"
## [29] "readSS" "mathSS" "proflvl" "race"
## [33] "mean_read" "mean_math" "sd_read" "sd_math"
## [37] "count_read" "count_math" "count_black" "per_black"
## [41] "spread" "schoolperf"
```

- It looks like it did a good job

- In complicated cases, merge has some important options we should review
- First is the simple sounding ‘by’ argument:
`simple_merge(df1,df2,by=c("id1","id2"))`

- We can also specify
`simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a"))`

- This allows us to have different names for our ID variables
- Now, what if we have two different sized objects and not all matches between them?
`notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE)`

- We can tell R whether we want to keep all of the
`x`

observations (df1), all the`y`

observations (df2) or neither, or both

- Reshaping data is a slightly different issue than aggregating data
- Let’s review the two data types: long and wide

`head(df[, 1:10], 3)`

```
## X school stuid grade schid dist white black hisp indian
## 1 44 1 149995 3 105 495 0 1 0 0
## 2 53 1 13495 3 45 495 0 1 0 0
## 3 116 1 106495 3 45 495 0 1 0 0
```

- Now let’s look at wide:

`head(widedf[, 28:40], 3)`

```
## readSS.2000 mathSS.2000 proflvl.2000 race.2000 X.2001 school.2001
## 1 357.3 387.3 basic B 441000 1
## 2 263.9 302.6 below basic B 531000 1
## 3 369.7 365.5 basic B 1161000 1
## grade.2001 schid.2001 dist.2001 white.2001 black.2001 hisp.2001
## 1 4 105 495 0 1 0
## 2 4 45 495 0 1 0
## 3 4 45 495 0 1 0
## indian.2001
## 1 0
## 2 0
## 3 0
```

- How did we do this?

- The great debate
- Most econometrics, panel, and time series datasets come wide and so these seem familiar
- R for most cases prefers long data, including for most graphing and analysis functions
- But, not all
- So we have to learn both

`reshape`

is the way to move from wide to long- The data stays the same, but the shape of it changes
- The long data had dimensions:
`2700, 32`

- The wide data has dimensions:
`1200, 91`

- How do we get to these numbers?
- The rows in the wide dataframe represent unique students

`widedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")`

`idvar`

represents the unit we want to represent a single row, in this case each unique student gets a single row- In this simple case
`timevar`

is the variable that differenaties between two rows with the same student ID `direction`

tells R we are going to move to wide data- As written all data will move, but using the
`varying`

argument we can tell R explicitly which items we want to move wide

- We often need to do this to plot data in R
- Luckily the
`reshape`

function works well in both directions

```
longdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[,
2:91]), direction = "long", sep = ".")
```

- If our data is formatted nicely, R can do the guessing and identify the years for us by parsing the dataframe names

- We have already seen a lot of subsetting examples above, which is what filtering is, but R provides some great shortcuts to this
- Let’s look at the
`subset`

function to get only 4th grade scores

```
g4 <- subset(df, grade == 4)
dim(g4)
```

`## [1] 400 32`

- This is equivalent to:

`g4_b <- df[df$grade == 4, ]`

- These two elements are the same:

`identical(g4, g4_b)`

`## [1] TRUE`

- Now you can filter, subset, sort, recode, and aggregate data!
- Let’s look at a few exercises to test these skills
- Once these skills are mastered, we can begin to understand how to automate R to clean data with known errors, and to recode data in R so it is ready to be used for analysis
- Then we can really take off!

Say we are unhappy about attributing the school/grade mean score across years to student-year observations like we did in this lesson. Let’s fix it by

**first**aggregating our student data frame to a school/grade/year data frame, and**second**by merging that new data frame with our student level data.Sort the student-level data frame on

`attday`

and`ability`

in descending order.Find the highest proportion of black students in any school/grade/year combination.

It is good to include the session info, e.g. this document is produced with **knitr** version `0.8`

. Here is my session info:

`print(sessionInfo(), locale = FALSE)`

```
## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods
## [7] base
##
## other attached packages:
## [1] plyr_1.7.1 ggplot2_0.9.2.1 lmtest_0.9-30 zoo_1.7-9
## [5] knitr_0.8
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-0 dichromat_1.2-4 digest_0.5.2
## [4] evaluate_0.4.2 formatR_0.6 grid_2.15.2
## [7] gtable_0.1.1 labeling_0.1 lattice_0.20-10
## [10] MASS_7.3-22 memoise_0.1 munsell_0.4
## [13] proto_0.3-9.2 RColorBrewer_1.0-5 reshape2_1.2.1
## [16] scales_0.2.2 stringr_0.6.1 tools_2.15.1
```

This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.