DPI R Bootcamp

Jared Knowles

- What is R?
- What is RStudio?
- How does it work?
- What makes the language different?
- Why learn it?

- R is an Open Source (and freely available) environment for statistical computing and graphics
- Available for Windows, Mac OS X, and Linux
- R is being actively developed with two major releases per year and dozens of releases of add on packages
- R can be extended with ‘packages’ that contain data, code, and documentation to add new functionality

The R workspace in RStudio

- R is a flavor of the
**S**computer language - S was developed by John Chambers at Bell Labs in the late 1970s
- In 1988 it was rewritten from a Fortran base to a C base
- Version 4 of S, the latest version, was finished in 1998, and won several awards

John Chambers, in describing the logic behind the S language said:

[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.

- 1991 in New Zealand Ross Ihaka and Robert Gentleman create R
- Named for their first initials
- R is made public in 1993, and in 1995 Martin Maechler convinces the creators to make it open source with the GNU General Public License
- 1997 R Core Group is formed–the maintainers and main developers of R (about 14 members today)
- 2000 version 1.0.0 ships
- 2012 version 2.15.2 is available (2.16.0 is due out early in 2013)

- R is a common tool among data experts at major universities
- No need to go through procurement, R can be installed in any environment on any machine and used with no licensing or agreements needed
- R source code is very readable to increase transparency of processes
- R code is easily borrowed from and shared with others
- R is incredibly flexible and can be adapted to specific local needs
- R is under incredibly active development, improving greatly, and supported wildly by both professional and academic developers

- R is free in many senses

- R can be run and used for any purpose, commercial or non-commercial, profit or not-for-profit
- R’s source code is freely available so you can study how it works and adapt it to your needs.
- R is free to redistribute so you can share it with your
~~enemies~~friends - R is free to modify and those modifications are free to redistribute and may be adopted by the rest of the community!

- R is platform agnostic–Linux, Mac, PC, server, desktop, etc.
- R can output results in a variety of formats
- R can build routines straight out of a database for common and universal reporting

- R plays nicely with data from Stata, SPSS, SAS and others
- R can check work, produce output, visualize results from other programs
- R can do bleeding edge analyses that aren’t available in proprietary packages yet
- R is becoming more prevalent in undergraduate statistics courses–more and more potential employees are learning it each year

- R is based on S, which is close to 40 years old
- R only has features that the community contributes
- Not the ideal solution to all problems
- R is a programming language and not a software package–steeper learning curve
- R can be much slower than compiled languages

R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS

R is linked to from more and more sites

These links come from the explosion of add-on packages to R

Usage of the R listserv for help has really exploded recently

Data from Bob Muenchen available online

**packages**are add on features to R that include data, new functions and methods, and extended capabilities. Think of them as ``apps’’ on your phone. We’ve already installed several!**terminal**this is the main window of R where you enter commands**scripts**these are where you store commands to be run in the terminal later, like syntax files in SPSS or .do files in Stata**functions**commands that do something to an object in R**dataframe**the main element for statistical purposes, an object with rows and columns that includes numbers, factors, and other data types**workspace**the working memory of R where all objects are stored**vector**the basic unit of data in R**symbols**used to name and store objects or to designate operations/functions**attributes**determine how functions act on objects

**R**- R works in the command line of any OS, but also comes with a basic GUI to operate on its own in Windows and Mac download**RStudio**- a much better way to work in R that allows editing of scripts, operation of R, viewing of the workspace, and R help all on one screen download- We’ve already installed and configured these, but what might you want to use to go further after the Bootcamp is complete?

**LaTeX**- for producing documents using R this is less necessary, but still useful. download WIN MAC**Dev Tools for R**- on Windows this is Rtools, on Linux and Mac it is installing the development mode of R download WIN MAC**Git**- for version control, sharing code, and collaboration this is essential. It integrates well with RStudio. download**pandoc**- for converting output into other formats for sharing with non-user**R**s! download**ImageMagick**- for creating more flexible graphics in R, including animations! download alternate

- This really represents a completely open source toolchain to going from a data analysis idea, to a full fledged professional report
- These tools are free, updated regularly, and available on
**any**platform**today** - This toolchain is scriptable, reproducible, and has been around for quite some time so it is stable and robust
- Lots of other programming languages use aspects of this toolchain

- Adding packages onto R means you also have to update them with the
`update.packages()`

command - Package updates are at the whim of the package developer which may be a corporation, an R Core Member, an academic, or me
- Upgrading R on Windows, which is on a 6 month release cycle, is not straightforward yet, but there are lots of how-to guides online
- Generally, upgrading is recommended as there are usually a lot of performance enhancements, even with minor (2.15.X) releases
- We will walk through this a bit later, but remember that the flexibility in R means that users probably need to be self-supported in your organization

- In the spirit of open-source R is very much a self-guided tool
- Let’s see, type:
`?summary`

- Now type:
`??regression`

- For tricky questions, funky error messages (there are many), and other issues, use Google (include “in R” to the end of your query)
- We can also use RSeek - the search engine just for R!
- StackOverflow has become a great resource with many questions for many specific packages in R, and a rating system for answers
- A number of R Core members contribute there

- Sometimes R Help can be a bit prickly and unfriendly…
- The most important part of getting help is being able to ask your question with a reproducible example (i.e. some short simulated data and code that doesn’t do what you want)
- For R Help etiquette (for the tough problems) see the great advice here

```
foo <- c(1, "b", 5, 7, 0)
bar <- c(1, 2, 3, 4, 5)
foo + bar
```

`Error: non-numeric argument to binary operator`

- Go ahead and open RStudio now and let’s see what we get
- 4 panels, various tabs
- Help, plots, file structure
- Workspace, history, version control
- Working files
- The R Console

- What happens when you type:

```
data(mtcars)
mtcars
```

```
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
```

- A data frame is a lot like a traditional spreadsheet and is the central feature of data analysis in R
- You’ll see we have names, stored with numbers, which other programming languages are not so keen on doing
- This is really handy, and data frames are the primary method we use to store data, manipulate data, etc.
- The difference between a data frame and spreadsheet in Excel is that in almost all cases, everything we do to the data frame will either be by entire row, or by entire column
- This sounds limiting, but in fact, it is incredibly powerful!

`2 + 2 # add numbers`

`[1] 4`

`2 * pi #multiply by a constant`

`[1] 6.283`

`7 + runif(1, min = 0, max = 1) #add a random variable`

`[1] 7.679`

`4^4 # powers`

`[1] 256`

`sqrt(4^4) # functions`

`[1] 16`

- In addition to the obvious
`+`

`-`

`=`

`/`

`*`

and exponential`^`

, there is also integer division`%/%`

and remainder in integer division (known as modulo arithmetic)`%%`

`2 + 2`

`[1] 4`

`2/2`

`[1] 1`

`2 * 2`

`[1] 4`

`2^2`

`[1] 4`

`2 == 2`

`[1] TRUE`

`23%/%2`

`[1] 11`

`23%%2`

`[1] 1`

`<-`

is the assignment operator, it declares something is something else

```
foo <- 3
foo
```

`[1] 3`

`:`

is the sequence operator

`1:10`

` [1] 1 2 3 4 5 6 7 8 9 10`

```
# it increments by one
a <- 100:120
a
```

```
[1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
[18] 117 118 119 120
```

**This is handy**

**#**denotes a comment in R- Anything after the
**#**is not evaluated and ignored in R - This is handy for making things reproducible

```
# Something I want to keep from R
# Like my secret from the R engine
# Maybe intended for a human and not the computer
# Like: Look at this cool plot!
myplot(readSS,mathSS,data=df)
```

- R also supports advanced mathematical features and expressions
- R can take integrals and derivatives and express complex functions
Easiest of all, R can generate distributions of data very easily

e.g.

`rnorm(100)`

or`rbinom(100)`

This comes in handy when writing examples and building analyses because it is trivial to generate a synthetic piece of data to use as an example

- To do more we need to learn how to manipulate the ‘workspace’.
- This includes all the vectors, datasets, and functions stored in memory.
- All R objects are stored in the memory of the computer, limiting the available space for calculation to the size of the RAM on your machine.
- R makes organizing the workspace easy.

```
x <- 5 #store a variable with <-
x #print the variable
```

`[1] 5`

```
z <- 3
ls() #list all variables
```

`[1] "x" "z"`

`ls.str() #list and describe variables`

```
x : num 5
z : num 3
```

```
rm(x) # delete a variable
ls()
```

`[1] "z"`

- R is more than statistical software, it is a computer language
- Like any language it has rules (some poorly enforced), and conventions
- You will learn more as you go, but we’ll go over a few to start

- Case sensitivity matters

```
a <- 3
A <- 4
print(c(a, A))
```

`[1] 3 4`

**a**≠**A**

- What happens if I type
**print(a,A)**?

`c`

is our friend- So what does
**c**do?

```
A <- c(3, 4)
print(A)
```

`[1] 3 4`

`c`

stands for concatenate and allows vectors to have multiple elements- If you ever need two elements in a vector, you need to wrap it up in
`c`

, which is one of the most used functions you will ever use `c`

is important to put any vector together, but remember that objects within a vector must all be of the same type

- In language there are a number of ways to say the same thing
- The dog chased the cat.
- The cat was chased by the dog.
- By the dog, the cat was chased.
- Some ways are more elegant than others, all convey the same message.

```
a <- runif(100) # Generate 100 random numbers
b <- runif(100) # 100 more
c <- NULL # Setup for loop (declare variables)
for (i in 1:100) {
# Loop just like in Java or C
c[i] <- a[i] * b[i]
}
d <- a * b
identical(c, d) # Test equality
```

`[1] TRUE`

- Which is nicer?
**c**or**d**?

R is maddeningly inconsistent in it’s naming conventions

- Some functions are
`camelCase`

; others`are.dot.separated`

; others`use_underscores`

- Function results are stored in a variety of ways across function implementations
- R has multiple graphics packages that different functions use for default plot construction (
`base`

,`grid`

,`lattice`

, and`ggplot2`

) - R has multiple packages and functions to do the same analysis as well, though some standardization has started to occur
Be flexible and be aware of R’s flexibility

- Everything in R is an object–even functions
- Objects can be manipulated many ways
- A common example is applying the `summary’ function to a variety of object types and seeing how it adapts

`summary(df[, 28:31]) #summary look at df object`

```
schoollow readSS mathSS proflvl
Min. :0.000 Min. :252 Min. :210 advanced : 788
1st Qu.:0.000 1st Qu.:430 1st Qu.:418 basic : 523
Median :0.000 Median :495 Median :480 below basic: 210
Mean :0.242 Mean :496 Mean :483 proficient :1179
3rd Qu.:0.000 3rd Qu.:562 3rd Qu.:543
Max. :1.000 Max. :833 Max. :828
```

`summary(df$readSS) #summary of a single column`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
252 430 495 496 562 833
```

-The `$`

says to look for object **readSS** in object **df**

```
library(ggplot2) # Load graphics Package
qplot(readSS,mathSS,data=df,geom='point',alpha=I(0.3))+theme_dpi()+
opts(title='Test Score Relationship')+
geom_smooth()
```

`Error: could not find function "theme_dpi"`

- R handles data differently than many other statistical packages
- In R, all elements are objects

`length(unique(df$school))`

`[1] 173`

`length(unique(df$stuid))`

`[1] 1200`

```
uniqstu <- length(unique(df$stuid))
uniqstu
```

`[1] 1200`

- Results of function calls can be stored

- The comparison operators
`<`

,`>`

,`<=`

,`>=`

,`==`

, and`!=`

are used to compare values across vectors

```
big <- c(9, 12, 15, 25)
small <- c(9, 3, 4, 2)
# Give us a nice vector of logical values
big > small
```

`[1] FALSE TRUE TRUE TRUE`

```
big = small
# Oops--don't do this, reassigns big to small
print(big)
```

`[1] 9 3 4 2`

`print(small)`

`[1] 9 3 4 2`

- Comparison operators can be tricky, so to keep it straight never use
`=`

or`==`

to assign anything, always use`<-`

- The best way to evaluate these objects is to use brackets
`[]`

to avoid confusion

```
big <- c(9, 12, 15, 25)
big[big == small]
```

`[1] 9`

```
# Returns values where the logical vector is true
big[big > small]
```

`[1] 12 15 25`

`big[big < small] # Returns an empty set`

`numeric(0)`

- The
`%in%`

operator determines whether each value in the left operand can be matched with one of the values in the right operand.

```
big <- c(9, 12, 15, 25)
small <- c(9, 12, 15, 25, 9, 1, 3)
big[small %in% big]
```

`[1] 9 12 15 25 NA`

- 9, 12, 15, and 25 all appear in
`big`

, but`small`

also has objects that do not appear in`big`

and so an NA is returned - What if we reverse this?

`big[big %in% small]`

`[1] 9 12 15 25`

- No
`NA`

- The logical operators
`|`

(or) and`&`

(and) can be used to combine two logical values and produce another logical value as the result. The operator`!`

(not) negates a logical value. These operators allow complex conditions to be constructed.

```
foo <- c("a", NA, 4, 9, 8.7)
!is.na(foo) # Returns TRUE for non-NA
```

`[1] TRUE FALSE TRUE TRUE TRUE`

`class(foo)`

`[1] "character"`

```
a <- foo[!is.na(foo)]
a
```

`[1] "a" "4" "9" "8.7"`

`class(a)`

`[1] "character"`

- The operators
`||`

and`&&`

are similar, but they combine two logical vectors. The comparison is performed element by element, so the result is also a logical vector.

```
zap <- c(1, 4, 8, 2, 9, 11)
zap[zap > 2 | zap < 8]
```

`[1] 1 4 8 2 9 11`

`zap[zap > 2 & zap < 8]`

`[1] 4`

- R also supports a full suite of regular expressions
- This could be material for a full tutorial in a more advanced bootcamp
- If you know and use regex, then rest assured you can keep using it in R

- R allows users to implement a number of different types of data
- The three basic data types are numeric data, character data, and logical data
- Vectors must be of one consistent type of data, so if you make a vector with multiple types, it generally defaults to being a character vector

**numeric**vectors contain, as you would guess, numbers!

`is.numeric(A)`

`[1] TRUE`

`class(A)`

`[1] "numeric"`

`print(A)`

`[1] 3 4`

**character**is known as strings in other software, any characters that have no numeric meaning

```
b <- c("one", "two", "three")
print(b)
```

`[1] "one" "two" "three"`

`is.numeric(b)`

`[1] FALSE`

**logical**is TRUE or FALSE values, useful for logical testing and programming- We’ve already seen these returned when we have asked R a question before

```
c <- c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)
is.numeric(c)
```

`[1] FALSE`

`is.character(c)`

`[1] FALSE`

`is.logical(c) # Results in a logical value`

`[1] TRUE`

- Just ask R using the
`class`

function

`class(A)`

`[1] "numeric"`

`class(b)`

`[1] "character"`

`class(c)`

`[1] "logical"`

- Vectors are collections of consistent data types
**numeric**can either be double or integer depending on the*bytes*size**logical****character****complex****raw**- All vectors must be consistent among types, but some data objects like data frames can combine multiple vectors of different types

- A factor is a very special and sometimes frustrating data type in R

```
myfac <- factor(c("basic", "proficient", "advanced", "minimal"))
class(myfac)
```

`[1] "factor"`

`myfac # What order are the factors in?`

```
[1] basic proficient advanced minimal
Levels: advanced basic minimal proficient
```

- What if we don’t like the order these are in? Factor order is important for all kinds of things like plot type, regression output, and more

- Ordered factors simply have an additional attribute explaining the order of the levels of a factor
- This is a useful shortcut when we want to preserve some of the meaning provided by the order
- Think cardinal data

```
myfac_o <- ordered(myfac, levels = c("minimal", "basic", "proficient", "advanced"))
myfac_o
```

```
[1] basic proficient advanced minimal
Levels: minimal < basic < proficient < advanced
```

`summary(myfac_o)`

```
minimal basic proficient advanced
1 1 1 1
```

- Turning factors into other data types can be tricky. All factor levels have an underlying numeric structure.

`class(myfac_o)`

`[1] "ordered" "factor" `

`unclass(myfac_o)`

```
[1] 2 3 4 1
attr(,"levels")
[1] "minimal" "basic" "proficient" "advanced"
```

```
defac <- unclass(myfac_o)
defac
```

```
[1] 2 3 4 1
attr(,"levels")
[1] "minimal" "basic" "proficient" "advanced"
```

- What is wrong with this? Well–why would
`minimal`

be`2`

and`basic`

be`3`

? - Be careful! The best way to unpack a factor is to convert it to a character first.

```
defac <- function(x) {
x <- as.character(x)
x
}
defac(myfac_o)
```

`[1] "basic" "proficient" "advanced" "minimal" `

```
defac <- defac(myfac_o)
defac
```

`[1] "basic" "proficient" "advanced" "minimal" `

- What if we do want it to be numeric?
- The best way to do this is to recode the variable manually–we’ll discuss this later
- You can try to convert it to numeric though, but do at your own risk:

`myfac_o`

```
[1] basic proficient advanced minimal
Levels: minimal < basic < proficient < advanced
```

`as.numeric(myfac_o)`

`[1] 2 3 4 1`

- If we did not properly specify the order above, this would be wrong!

`myfac`

```
[1] basic proficient advanced minimal
Levels: advanced basic minimal proficient
```

`as.numeric(myfac)`

`[1] 2 4 1 3`

- R has built-in ways to handle dates
- See
`lubridate`

package for more advanced functionality

```
mydate <- as.Date("7/20/2012", format = "%m/%d/%Y")
# Input is a character string and a parser
class(mydate) # this is date
```

`[1] "Date"`

`weekdays(mydate) # what day of the week is it?`

`[1] "Friday"`

`mydate + 30 # Operate on dates`

`[1] "2012-08-19"`

```
# We can parse other formats of dates
mydate2 <- as.Date("8-5-1988", format = "%d-%m-%Y")
mydate2
```

`[1] "1988-05-08"`

```
mydate - mydate2
```

`Time difference of 8839 days`

`# Can add and subtract two date objects`

- R converts all dates to numeric values, like Excel and other languages
- The origin date in R is January 1, 1970

`as.numeric(mydate) # days since 1-1-1970`

`[1] 15541`

`as.Date(56, origin = "2013-4-29") # we can set our own origin`

`[1] "2013-06-24"`

- R classes can be specified for any special purpose
- Like linear models

```
b <- rnorm(5000)
c <- runif(5000)
a <- b + c
mymod <- lm(a ~ b)
class(mymod)
```

`[1] "lm"`

- Classes determine what you can and can’t do with objects
- Classes have different computational times associated with them, for optimization
- Classes allow you to keep projects/data organized and following business rules
**Because R makes you care**

- R has a number of basic data classes as well as arbitrary specialized object types for various purposes
**vectors**are the basic data class in R and can be thought of as a single column of data (even a column of length 1)**matrices and arrays**are rows and columns of all the same mode data**dataframes**are rows and columns where the columns can represent different data types**lists**are arbitrary combinations of disparate object types in R

- Everything is a vector in R, even single numbers
- Single objects are “atomic” vectors

`print(1)`

`[1] 1`

```
# The 1 in braces means this element is a vector of length 1
print("This tutorial is awesome")
```

`[1] "This tutorial is awesome"`

```
# This is a vector of length 1 consisting of a single 'string of
# characters'
```

`print(LETTERS)`

```
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
```

```
# This vector has 26 character elements
print(LETTERS[6])
```

`[1] "F"`

```
# The sixth element of this vector has length 1
length(LETTERS[6])
```

`[1] 1`

`# The length of that element is a number with length 1`

- Matrices are combinations of vectors of the same length and data type
- We can have numeric matrices, character matrices, or logical matrices
- Can’t mix types

```
mymat <- matrix(1:36, nrow = 6, ncol = 6)
rownames(mymat) <- LETTERS[1:6]
colnames(mymat) <- LETTERS[7:12]
class(mymat)
```

`[1] "matrix"`

`rownames(mymat)`

`[1] "A" "B" "C" "D" "E" "F"`

`colnames(mymat)`

`[1] "G" "H" "I" "J" "K" "L"`

`mymat`

```
G H I J K L
A 1 7 13 19 25 31
B 2 8 14 20 26 32
C 3 9 15 21 27 33
D 4 10 16 22 28 34
E 5 11 17 23 29 35
F 6 12 18 24 30 36
```

- We can add to matrices

`dim(mymat) # We have 6 rows and 6 columns`

`[1] 6 6`

```
myvec <- c(5, 3, 5, 6, 1, 2)
length(myvec) # What happens when you do dim(myvec)?
```

`[1] 6`

```
newmat <- cbind(mymat, myvec)
newmat
```

```
G H I J K L myvec
A 1 7 13 19 25 31 5
B 2 8 14 20 26 32 3
C 3 9 15 21 27 33 5
D 4 10 16 22 28 34 6
E 5 11 17 23 29 35 1
F 6 12 18 24 30 36 2
```

- Dataframes work similar

- We can do some basic math to matrices as well, like correlations

```
foo.mat <- matrix(c(rnorm(100), runif(100), runif(100), rpois(100, 2)), ncol = 4)
head(foo.mat)
```

```
[,1] [,2] [,3] [,4]
[1,] -0.5002 0.9585 0.6236 0
[2,] -2.4363 0.7872 0.8873 2
[3,] -1.2383 0.5736 0.9900 1
[4,] -0.9142 0.3383 0.6566 2
[5,] -0.7001 0.1646 0.7553 4
[6,] 0.1098 0.4470 0.9363 1
```

`cor(foo.mat)`

```
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.0009384 -0.0840 -0.09061
[2,] 0.0009384 1.0000000 -0.1209 -0.17833
[3,] -0.0839969 -0.1209120 1.0000 -0.09450
[4,] -0.0906143 -0.1783325 -0.0945 1.00000
```

- The result is a matrix itself, but we can force it to be something else

- Let’s make a matrix be a dataframe

```
mycorr <- cor(foo.mat)
class(mycorr)
```

`[1] "matrix"`

```
mycorr2 <- as.data.frame(mycorr)
class(mycorr2)
```

`[1] "data.frame"`

`mycorr2`

```
V1 V2 V3 V4
1 1.0000000 0.0009384 -0.0840 -0.09061
2 0.0009384 1.0000000 -0.1209 -0.17833
3 -0.0839969 -0.1209120 1.0000 -0.09450
4 -0.0906143 -0.1783325 -0.0945 1.00000
```

- Arrays are a set of matrices of the same
`dim`

and`class`

- Arrays allow dimensions to be named

```
myarray <- array(1:42, dim = c(7, 3, 2), dimnames = list(c("tiny", "small",
"medium", "medium-ish", "large", "big", "huge"), c("slow", "moderate", "fast"),
c("boring", "fun")))
class(myarray)
```

`[1] "array"`

`dim(myarray)`

`[1] 7 3 2`

`dimnames(myarray)`

```
[[1]]
[1] "tiny" "small" "medium" "medium-ish" "large"
[6] "big" "huge"
[[2]]
[1] "slow" "moderate" "fast"
[[3]]
[1] "boring" "fun"
```

`myarray`

```
, , boring
slow moderate fast
tiny 1 8 15
small 2 9 16
medium 3 10 17
medium-ish 4 11 18
large 5 12 19
big 6 13 20
huge 7 14 21
, , fun
slow moderate fast
tiny 22 29 36
small 23 30 37
medium 24 31 38
medium-ish 25 32 39
large 26 33 40
big 27 34 41
huge 28 35 42
```

- Lists are arbitrary collections of objects.
- The objects do not have to be of the same type or same element or same dimensions

```
mylist <- list(vec = myvec, mat = mymat, arr = myarray, date = mydate)
class(mylist)
```

`[1] "list"`

`length(mylist)`

`[1] 4`

`names(mylist)`

`[1] "vec" "mat" "arr" "date"`

`str(mylist)`

```
List of 4
$ vec : num [1:6] 5 3 5 6 1 2
$ mat : int [1:6, 1:6] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "A" "B" "C" "D" ...
.. ..$ : chr [1:6] "G" "H" "I" "J" ...
$ arr : int [1:7, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "dimnames")=List of 3
.. ..$ : chr [1:7] "tiny" "small" "medium" "medium-ish" ...
.. ..$ : chr [1:3] "slow" "moderate" "fast"
.. ..$ : chr [1:2] "boring" "fun"
$ date: Date[1:1], format: "2012-07-20"
```

- R has two object classification schemes S3 and S4
- For S3 use
`$`

or`[[]]`

to extract elements - For S4 use
`@`

to extract elements

`mylist$vec`

`[1] 5 3 5 6 1 2`

`mylist[[2]][1, 3]`

`[1] 13`

- Where are we getting the object in the second row from?

- Matrices, lists, and arrays are useful for storing analyses results, generating reports, and doing analysis on many objects types
- We’ll see examples of list and array manipulation later
- A useful tip is to use the
`attributes`

function to learn about the object

`attributes(mylist)`

```
$names
[1] "vec" "mat" "arr" "date"
```

`attributes(myarray)[1:2][2]`

```
$dimnames
$dimnames[[1]]
[1] "tiny" "small" "medium" "medium-ish" "large"
[6] "big" "huge"
$dimnames[[2]]
[1] "slow" "moderate" "fast"
$dimnames[[3]]
[1] "boring" "fun"
```

- They also provide simplified ways to get used to operating on dataframes by reducing complexity

- Dataframes are combinations of vectors of the same length, but can be of different types

`str(df[, 25:32])`

```
'data.frame': 2700 obs. of 8 variables:
$ district : int 3 3 3 3 3 3 3 3 3 3 ...
$ schoolhigh: int 0 0 0 0 0 0 0 0 0 0 ...
$ schoolavg : int 1 1 1 1 1 1 1 1 1 1 ...
$ schoollow : int 0 0 0 0 0 0 0 0 0 0 ...
$ readSS : num 357 264 370 347 373 ...
$ mathSS : num 387 303 365 344 441 ...
$ proflvl : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
$ race : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...
```

- Data frames must have consistent dimensions
- Dataframes are what we use most commonly as a “dataset” for analysis
- Dataframes are what sets R apart from other programming languages like C, C++, Python, and Perl.
- The dataframe structure is much more complex and much easier to use than any datastructure in these languages–though Python is catching up!

- R has built in functions to allow you to force objects to move between types
- These follow the general form
`as.whatIwant`

as in`as.factor`

or`as.table`

or`as.data.frame`

- You will use these commands a lot to convert output from various functions into a form you can input into a different function
- A good example is converting correlation matrices into dataframes so we can plot them

- Vectors are used to store bits of data
- Matrices are combinations of vectors of the same length and type
- Matrices are most commonly used in statistical models (in the background), and for computation
- Arrays are stacks of matrices and are used in building multiple models or for storing complex data structures
- Lists are groups of R objects commonly used to combine function output in useful ways (like store model results and model data together)

Create a matrix of 5x6 dimensions. Add a vector to it (as either a row or column). Identify element 2,3.

Convert the matrix to a data frame.

Look at the difference between data frames and matrices.

- There are a number of great R books available for learning the beginnings of R and for learning specific statistical techniques
- “Discovering Statistics Using R” by Andy Field
- “Applied Econometrics with R” by Achim Zeileis and Christian Kleiber
- “The Art of R Programming” by Norman Matloff
- The R Inferno
- The R Book by Michael J. Crawley (new version imminent)
- The R Cookbook

It is good to include the session info, e.g. this document is produced with **knitr** version `0.8`

. Here is my session info:

`print(sessionInfo(), locale = FALSE)`

```
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] plyr_1.7.1 Cairo_1.5-2 mgcv_1.7-22 hexbin_1.26.0
[5] lattice_0.20-10 ggplot2_0.9.2.1 lmtest_0.9-30 zoo_1.7-9
[9] knitr_0.8 shiny_0.1.9 websockets_1.1.5 digest_0.5.2
[13] caTools_1.13 bitops_1.0-4.2
loaded via a namespace (and not attached):
[1] colorspace_1.2-0 dichromat_1.2-4 evaluate_0.4.2
[4] formatR_0.6 gtable_0.1.1 labeling_0.1
[7] markdown_0.5.3 MASS_7.3-22 Matrix_1.0-10
[10] memoise_0.1 munsell_0.4 nlme_3.1-105
[13] proto_0.3-9.2 RColorBrewer_1.0-5 reshape2_1.2.1
[16] RJSONIO_1.0-1 scales_0.2.2 stringr_0.6.1
[19] tools_2.15.1 xtable_1.7-0
```