Basic Statistical Concepts
R Bootcamp HTML Slides
Jared Knowles
Introduction
We will review the following statistical concepts here through the lens of R:
- Probability
- Statistical distributions
- Attributes of distributions
- Samples
- Statistical models
- Univariate Regression
- Hypothesis testing
- Multivariate Regression
- Causality
What is statistics?
“Statistics is the study of the collection, organization, analysis, interpretation, and presentation of data. It deals with all aspects of this, including the planning of data collection”
- Statistics gives us a way to quantify and express our uncertainty about the future or about the relationship of a sample to the population
Key Concepts to Statistical Thinking
- Probability
- How to describe data
- Drawing inferences from data
- Causation
Probability
- Probability rests on the concepts of randomness
- Randomness is an abstract principle that describes the behavior of events in the world
- Random phenomena have outcomes we cannot perfectly predict, but that have a regular distribution over many repetitions
- Probability describes the the proportion of times an event will occur in many repeated trials of observation
- The probability of the Packers winning the coin toss is 0.5
- Let’s look at an example of coin tosses!
Probability’s uses
- For truly random independent events like coin flips, this is easy and uncertainty can become certainty
- But, what about events that depend on one another?
But…
- What does it mean then when someone says “Kobe Bryant’s 3 has a 40% chance of going in.”
- Or “Mason Crosby has a 65% chance of making that field goal?”
- These aren’t random events, are they?
- But we can model them as such with large enough sample size
Interdependence and Independence
- Coins are obviously independent in their flipping (though we can think of conditions where this is not true)
- Other events have probability that varies based on other factors - joint probabilities
- Let’s look at a simple example of this
- Coin Flip Demo
Describing Data
- At it’s most basic level, statistics is about summarizing and understanding data
- Data are themselves abstractions of real world concepts we care about
- There are a number of useful ways to describe a set of data, some of which we have become familiar with so far
Spread of the Data
- Data spread describes how scattered a data set is
- One type of data, categorical data, describes groups
Let’s try another
- What can we learn from this chart type about the data?
- What else might we want to learn?
How about?
Still more
- With diamonds we immediately want to look at price
- What do you see?
- Outliers?
- Data modes?
- Clusters?
Graphical Depictions of Data
- These are ways to show data with graphics
- Graphical displays are driven by the concept of dimensions
- One dimension–a single category
- Two dimensions–two categories
Levels of Measurement
- Any given dimension may be measured at different levels of measure
- Nominal: unordered categories of data
- Ordinal: ordered categories of data, relative size and degree of difference between categories is unknown
- Interval: ordered categories of data, fixed width, like discrete temperature scales
- Continuous (ratio): a measurement scale in a continuous space with a meaningful zero–physical measurements
- This classification was derived by Stanley Smith Stevens in the 1940s and 50s
Quiz 1
- Color (of diamonds) is what level of measurement?
- Nominal
- Carats are what level of measurement?
- Continuous
Levels of measurement matter
- How you depict the data
- What you can calculate using the data
Describing Data with Numbers
- What types of measures can we use to describe different levels of measurement?
Nominal |
mode, Chi-squared |
Ordinal |
median, percentile |
Interval |
mean, std. deviation, correlation, ANOVA |
Continuous |
geometric mean, harmonic mean, logarithms |
Let’s talk about these statistics
- STATISTIC: a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items comprising the sample which are known together as a set of data. (Wikipedia)[http://en.wikipedia.org/wiki/Statistic]
- These statistics can measure a number of features of a dataset, but we tend to think of them as measuring either central tendency, spread, or association
- We’ll focus on these today.
Measures of Central Tendency
- These are the three canonical measures of central tendency:
- Mean
- Median
- Mode
- How are these different? What properties do they have? Why does this matter?
library(xtable)
print(xtable(table(mpg$hwy)), type = "html")
|
V1
|
12
|
5
|
14
|
2
|
15
|
10
|
16
|
7
|
17
|
31
|
18
|
10
|
19
|
13
|
20
|
11
|
21
|
2
|
22
|
7
|
23
|
7
|
24
|
13
|
25
|
15
|
26
|
32
|
27
|
14
|
28
|
7
|
29
|
22
|
30
|
4
|
31
|
7
|
32
|
4
|
33
|
2
|
34
|
1
|
35
|
2
|
36
|
2
|
37
|
1
|
41
|
1
|
44
|
2
|
The Mean
- Commonly thought of as the average
- Add up all the values, divide by the number of observations
- 2 + 6 + 3 + 9 + 1 = 21
- Divide by 5
- 4.2
- The mean is sensitive to the number of observations and the spread of the data
The Mode
- The most common observation
- This can be problematic since skewed distributions can have a mode that diverges far from the mean or median
- The mode is useful in cases where you are looking for data errors (the most common value) and is also used a lot in fitting statistical models
Measures of spread
- It is handy to describe the central tendency of the data, but we also need a sense of how much the data resembles the central tendency
- Measures of spread help us achieve this
Quantiles
- Quantiles are used to divide the data into an even number of observations per “bin”
- An example is percentiles where data is divided evently into 100 groups, or quartiles where the data is divided into 4 groups
- This is handy to look at the way the distribution of the data behaves
Standard Deviation and Variance
- When we have multiple observations of data we are interested in how spread apart these values can be
- The standard way to measure this is to use the variance and standard deviation (which is the square root of the variance)
- Variance measures how far away from the mean the data can be and can be easily calculated by subtracting each element of data from the mean, squaring that difference, adding them together, and dividing by the number of elements
Skew
- Skew describes the symmetry of the distribution
- Is it balanced? Are there clusters?
- Skew is in reference to a norm - most often the normal distribution, but not always
Distributions of Data
- All of the above are the concepts we use to describe a univariate group of data
- Statisticians use the idea of a distribution to summarize how data generated through some process looks
- Think about the difference between a coin flip and a test score
- Both are generated by a stochastic process
- Both may generate the same amount of numbers
- But, we have very different expectations about what values those numbers take in the real world
- A distribution describes our expectation not just about the mean of the data (the average value), but also of how spread out it is likely to be
Pictures of distributions
- Let’s look at some distributions of data and talk about what they might represent
- Different distributions describe different data generating processes, and these processes can be used to represent a large array of data types
Normal
qplot(rnorm(3000), geom = "density", adjust = 2) + theme_dpi()
Poisson
qplot(rpois(3000, lambda = 3), geom = "density", adjust = 2) + theme_dpi()
Binomial
qplot(rbinom(3000, 1, 0.5), geom = "bar") + theme_dpi()
Weibull
qplot(rweibull(3000, shape = 18, scale = 1), geom = "density") + theme_dpi()
Distribution Demo
- Drawing Distributions
- These parameters define distributions (there are others as well), but you can see what a distribution does–it summarizes data
Sampling
- Much of traditional statistics is based on the idea of samples
- Sampling is the process of picking observations out of a population in a way that makes the smaller set of observations reflective of the population
- How we sample determines what conclusions we can draw in the population from our data
- The size of the sample relative to the population we are making inference about determines our confidence / precision
- Let’s look at a demo at some different types of sampling-
- Sampling Demo
Let’s look at Pearson’s Coefficient
We can identify other types of correlation though:
Regression and Statistical Models
- Statistical models are mathematical representations of real world phenomena
- We use statistical models to summarize and describe the relationships between pieces of data
- Let’s look at the simplest version of regression model and the basis of many statistical techniques: Ordinary Least Squares regression (OLS)
- DEMO
Hypothesis testing
- What does it mean when someone says something is statistically significant?
- “A significance is a formal procedure for comparing observed data with a hypothesis whose truth we want to assess.”~ Moore and McCabe p. 436
- Hypothesis testing is based on the idea of sampling and centers around the question of “Do we have enough data to believe the relationship we are seeing in our sample is true in the population?”
- In a hypothesis test we specify two hypotheses–the null hypothesis and the alternative hypothesis
- A test of significance then assesses the evidence against the null hypothesis in terms of probability
Most Common Applications
- The most common application of these is when you compare a sample to a population or a true fixed value
- A classic example is fluctuation in weight, for an athlete for example
wt <- c(190.5, 189, 195.5, 187, 191, 190.4, 186, 183, 193, 188)
t.test(wt, mu = 187, alternative = "two.sided")
##
## One Sample t-test
##
## data: wt
## t = 2.067, df = 9, p-value = 0.06866
## alternative hypothesis: true mean is not equal to 187
## 95 percent confidence interval:
## 186.8 191.9
## sample estimates:
## mean of x
## 189.3
Questions
##
## One Sample t-test
##
## data: wt
## t = 2.067, df = 9, p-value = 0.06866
## alternative hypothesis: true mean is not equal to 187
## 95 percent confidence interval:
## 186.8 191.9
## sample estimates:
## mean of x
## 189.3
- the P value reflects our belief about the probability of the test statistic taking a value as extreme or more extreme than what we observe
- This translates into our belief about “extremeness” of the value of interest, the average weight, in relation to the comparison group, in this case 187
- Traditionally, researchers set a fixed value of p for comparison before conducting the test, usually .05, sometimes .10 and sometimes .01
- If our p is less than the pre-specified p value we set, then we say the observed value is statistically significant
Let’s look at an example
DEMO
Some practical advice about statistical signficance
- Statistical significance is very different from substantive significance. Variables can be statistically significant but have no observable difference. For example, with enough precise measurements we could demonstrate a statistically significant difference in weight of 0.2 pounds, but in most cases this is substantively meaningless.
- Statistical significance tells us nothing about the true value in the population except that it is not equal to the assumption of the null hypothesis. We are not testing what range of values our measurement could take in the real world, only if it is plausibly different from another set of values.
- Statistical significance is only valid under assumptions about the data. What might be a few ways that the data in our weight example could invalidate the usefulness of a significance test?
Session Info
It is good to include the session info, e.g. this document is produced with knitr version 1.1. Here is my session info:
print(sessionInfo(), locale = FALSE)
## R version 2.15.2 (2012-10-26)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xtable_1.7-1 mgcv_1.7-22 eeptools_0.2 ggplot2_0.9.3.1
## [5] knitr_1.1 Cairo_1.5-2 shiny_0.4.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-5 caTools_1.14 colorspace_1.2-1
## [4] dichromat_2.0-0 digest_0.6.3 evaluate_0.4.3
## [7] formatR_0.7 grid_2.15.2 gtable_0.1.2
## [10] labeling_0.1 lattice_0.20-15 MASS_7.3-23
## [13] Matrix_1.0-11 munsell_0.4 nlme_3.1-108
## [16] plyr_1.8 proto_0.3-10 RColorBrewer_1.0-5
## [19] reshape2_1.2.2 RJSONIO_1.0-2 scales_0.2.3
## [22] stringr_0.6.2 tools_2.15.1 websockets_1.1.7