Introduction

We will review the following statistical concepts here through the lens of R:

Probability
Statistical distributions
Attributes of distributions
Samples
Statistical models
Univariate Regression
Hypothesis testing
Multivariate Regression
Causality

What is statistics?

What do you think?
From Wikipedia:

“Statistics is the study of the collection, organization, analysis, interpretation, and presentation of data. It deals with all aspects of this, including the planning of data collection”

Statistics gives us a way to quantify and express our uncertainty about the future or about the relationship of a sample to the population

Key Concepts to Statistical Thinking

Probability
How to describe data
Drawing inferences from data
Causation

Probability

Probability rests on the concepts of randomness
Randomness is an abstract principle that describes the behavior of events in the world
Random phenomena have outcomes we cannot perfectly predict, but that have a regular distribution over many repetitions
Probability describes the the proportion of times an event will occur in many repeated trials of observation
The probability of the Packers winning the coin toss is 0.5
Let’s look at an example of coin tosses!

Probability’s uses

For truly random independent events like coin flips, this is easy and uncertainty can become certainty
But, what about events that depend on one another?

But…

What does it mean then when someone says “Kobe Bryant’s 3 has a 40% chance of going in.”
Or “Mason Crosby has a 65% chance of making that field goal?”
These aren’t random events, are they?
But we can model them as such with large enough sample size

Interdependence and Independence

Coins are obviously independent in their flipping (though we can think of conditions where this is not true)
Other events have probability that varies based on other factors - joint probabilities
Let’s look at a simple example of this
Coin Flip Demo

Describing Data

At it’s most basic level, statistics is about summarizing and understanding data
Data are themselves abstractions of real world concepts we care about
There are a number of useful ways to describe a set of data, some of which we have become familiar with so far

Spread of the Data

Data spread describes how scattered a data set is
One type of data, categorical data, describes groups

What can we learn here?

Let’s try another

What can we learn from this chart type about the data?

What else might we want to learn?

Still more

With diamonds we immediately want to look at price

What do you see?
Outliers?
Data modes?
Clusters?

Graphical Depictions of Data

These are ways to show data with graphics
Graphical displays are driven by the concept of dimensions
One dimension–a single category
Two dimensions–two categories

Levels of Measurement

Any given dimension may be measured at different levels of measure
Nominal: unordered categories of data
Ordinal: ordered categories of data, relative size and degree of difference between categories is unknown
Interval: ordered categories of data, fixed width, like discrete temperature scales
Continuous (ratio): a measurement scale in a continuous space with a meaningful zero–physical measurements
This classification was derived by Stanley Smith Stevens in the 1940s and 50s

Quiz 1

Color (of diamonds) is what level of measurement?
Nominal
Carats are what level of measurement?
Continuous

Levels of measurement matter

How you depict the data
What you can calculate using the data

Describing Data with Numbers

What types of measures can we use to describe different levels of measurement?

Level of Meas.	Stats
Nominal	mode, Chi-squared
Ordinal	median, percentile
Interval	mean, std. deviation, correlation, ANOVA
Continuous	geometric mean, harmonic mean, logarithms

Let’s talk about these statistics

STATISTIC: a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items comprising the sample which are known together as a set of data. (Wikipedia)[http://en.wikipedia.org/wiki/Statistic]
These statistics can measure a number of features of a dataset, but we tend to think of them as measuring either central tendency, spread, or association
We’ll focus on these today.

Measures of Central Tendency

These are the three canonical measures of central tendency:
Mean
Median
Mode
How are these different? What properties do they have? Why does this matter?

library(xtable)
print(xtable(table(mpg$hwy)), type = "html")

	V1
12	5
14	2
15	10
16	7
17	31
18	10
19	13
20	11
21	2
22	7
23	7
24	13
25	15
26	32
27	14
28	7
29	22
30	4
31	7
32	4
33	2
34	1
35	2
36	2
37	1
41	1
44	2

The Mean

Commonly thought of as the average
Add up all the values, divide by the number of observations
2 + 6 + 3 + 9 + 1 = 21
Divide by 5
4.2
The mean is sensitive to the number of observations and the spread of the data

The Median

This is a measure of the value the middle observation takes
Order the data: 1 2 3 6 9 ; get the length (5)
Count (length / 2) + 1 in this case, the 3rd observation = 3
What if we add another number?
1 2 3 6 9 13
Count (length /2) from each side and average the two observations that remain
3 + 6 = 9
9/2 = 4.5
The median is sensitive to the spread of the data

The Mode

The most common observation
This can be problematic since skewed distributions can have a mode that diverges far from the mean or median
The mode is useful in cases where you are looking for data errors (the most common value) and is also used a lot in fitting statistical models

Measures of spread

It is handy to describe the central tendency of the data, but we also need a sense of how much the data resembles the central tendency
Measures of spread help us achieve this

Quantiles

Quantiles are used to divide the data into an even number of observations per “bin”
An example is percentiles where data is divided evently into 100 groups, or quartiles where the data is divided into 4 groups
This is handy to look at the way the distribution of the data behaves

Standard Deviation and Variance

When we have multiple observations of data we are interested in how spread apart these values can be
The standard way to measure this is to use the variance and standard deviation (which is the square root of the variance)
Variance measures how far away from the mean the data can be and can be easily calculated by subtracting each element of data from the mean, squaring that difference, adding them together, and dividing by the number of elements

Skew

Skew describes the symmetry of the distribution
Is it balanced? Are there clusters?
Skew is in reference to a norm - most often the normal distribution, but not always

Distributions of Data

All of the above are the concepts we use to describe a univariate group of data
Statisticians use the idea of a distribution to summarize how data generated through some process looks
Think about the difference between a coin flip and a test score
Both are generated by a stochastic process
Both may generate the same amount of numbers
But, we have very different expectations about what values those numbers take in the real world
A distribution describes our expectation not just about the mean of the data (the average value), but also of how spread out it is likely to be

Pictures of distributions

Let’s look at some distributions of data and talk about what they might represent
Different distributions describe different data generating processes, and these processes can be used to represent a large array of data types

Normal

The bell curve

qplot(rnorm(3000), geom = "density", adjust = 2) + theme_dpi()

Uniform

qplot(runif(1e+05, min = -5, max = 5), geom = "bar") + theme_dpi()

Poisson

qplot(rpois(3000, lambda = 3), geom = "density", adjust = 2) + theme_dpi()

Binomial

qplot(rbinom(3000, 1, 0.5), geom = "bar") + theme_dpi()

Weibull

Skewness

qplot(rweibull(3000, shape = 18, scale = 1), geom = "density") + theme_dpi()

Distribution Demo

Drawing Distributions
These parameters define distributions (there are others as well), but you can see what a distribution does–it summarizes data

Sampling

Much of traditional statistics is based on the idea of samples
Sampling is the process of picking observations out of a population in a way that makes the smaller set of observations reflective of the population
How we sample determines what conclusions we can draw in the population from our data
The size of the sample relative to the population we are making inference about determines our confidence / precision
Let’s look at a demo at some different types of sampling-
Sampling Demo

Measures of association

Correlation is not causation
Correlation is a measure of the dependence between two variables and there are a number of different versions of measuring correlation
The most common is Pearson’s r which measures linear dependence on a scale of -1 to 1
Rank correlation is an option as well using Spearman’s rank correlation coefficient or Kendall’s tau rank correlation coefficient

Let’s look at Pearson’s Coefficient

Visualizing Correlations

We can identify other types of correlation though:

Regression and Statistical Models

Statistical models are mathematical representations of real world phenomena
We use statistical models to summarize and describe the relationships between pieces of data
Let’s look at the simplest version of regression model and the basis of many statistical techniques: Ordinary Least Squares regression (OLS)
DEMO

Hypothesis testing

What does it mean when someone says something is statistically significant?
“A significance is a formal procedure for comparing observed data with a hypothesis whose truth we want to assess.”~ Moore and McCabe p. 436
Hypothesis testing is based on the idea of sampling and centers around the question of “Do we have enough data to believe the relationship we are seeing in our sample is true in the population?”
In a hypothesis test we specify two hypotheses–the null hypothesis and the alternative hypothesis
A test of significance then assesses the evidence against the null hypothesis in terms of probability

Most Common Applications

The most common application of these is when you compare a sample to a population or a true fixed value
A classic example is fluctuation in weight, for an athlete for example

wt <- c(190.5, 189, 195.5, 187, 191, 190.4, 186, 183, 193, 188)
t.test(wt, mu = 187, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  wt 
## t = 2.067, df = 9, p-value = 0.06866
## alternative hypothesis: true mean is not equal to 187 
## 95 percent confidence interval:
##  186.8 191.9 
## sample estimates:
## mean of x 
##     189.3

Questions

## 
##  One Sample t-test
## 
## data:  wt 
## t = 2.067, df = 9, p-value = 0.06866
## alternative hypothesis: true mean is not equal to 187 
## 95 percent confidence interval:
##  186.8 191.9 
## sample estimates:
## mean of x 
##     189.3

the P value reflects our belief about the probability of the test statistic taking a value as extreme or more extreme than what we observe
This translates into our belief about “extremeness” of the value of interest, the average weight, in relation to the comparison group, in this case 187
Traditionally, researchers set a fixed value of p for comparison before conducting the test, usually .05, sometimes .10 and sometimes .01
If our p is less than the pre-specified p value we set, then we say the observed value is statistically significant

Let’s look at an example

DEMO

Some practical advice about statistical signficance

Statistical significance is very different from substantive significance. Variables can be statistically significant but have no observable difference. For example, with enough precise measurements we could demonstrate a statistically significant difference in weight of 0.2 pounds, but in most cases this is substantively meaningless.
Statistical significance tells us nothing about the true value in the population except that it is not equal to the assumption of the null hypothesis. We are not testing what range of values our measurement could take in the real world, only if it is plausibly different from another set of values.
Statistical significance is only valid under assumptions about the data. What might be a few ways that the data in our weight example could invalidate the usefulness of a significance test?

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 1.1. Here is my session info:

print(sessionInfo(), locale = FALSE)

## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xtable_1.7-1    mgcv_1.7-22     eeptools_0.2    ggplot2_0.9.3.1
## [5] knitr_1.1       Cairo_1.5-2     shiny_0.4.0    
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-5       caTools_1.14       colorspace_1.2-1  
##  [4] dichromat_2.0-0    digest_0.6.3       evaluate_0.4.3    
##  [7] formatR_0.7        grid_2.15.2        gtable_0.1.2      
## [10] labeling_0.1       lattice_0.20-15    MASS_7.3-23       
## [13] Matrix_1.0-11      munsell_0.4        nlme_3.1-108      
## [16] plyr_1.8           proto_0.3-10       RColorBrewer_1.0-5
## [19] reshape2_1.2.2     RJSONIO_1.0-2      scales_0.2.3      
## [22] stringr_0.6.2      tools_2.15.1       websockets_1.1.7

	V1
12	5
14	2
15	10
16	7
17	31
18	10
19	13
20	11
21	2
22	7
23	7
24	13
25	15
26	32
27	14
28	7
29	22
30	4
31	7
32	4
33	2
34	1
35	2
36	2
37	1
41	1
44	2

	V1
12	5
14	2
15	10
16	7
17	31
18	10
19	13
20	11
21	2
22	7
23	7
24	13
25	15
26	32
27	14
28	7
29	22
30	4
31	7
32	4
33	2
34	1
35	2
36	2
37	1
41	1
44	2

Basic Statistical Concepts