Lecture 11
Farmingdale State College
Correlation ≠ Causation
r
Positive correlation: variables move together
As one goes up, the other goes up.
As one goes down, the other goes down.
Assesses how two variables move in relation to one another.
What are measures of dispersion?
Measures of how spread out the data are, how much participants differ.
How do we measure dispersion when we were looking at one variable?
Standard deviation
Variance
Univariate = one variable
Bivariate = two variables
Multivariate = more than two variables
Univariate = one variable
Bivariate = two variables
Multivariate = more than two variables
Univariate measures of dispersion?
Bivariate measures of dispersion:
Covariance
How much variation in one variable is the same in another variable?
Do the variables vary in the same way?
Univariate measures of dispersion?
Standard deviation: Height M = 5'6"
, s = 2 inches
Bivariate measures of dispersion:
Professor Brocker wants to know if how many cups of coffee a professor drinks daily impacts their happiness. She asks
50
professors how many cups they have in a typical day and then has them rate their happiness on a scale of1 to 10
.
Variable 1: Number of Cups of Coffee = continuous
Variable 2: Happiness = continuous
We could:
Find the variance of each variable
Find the covariance of the two variables together
Covariance refers to how two continuous variables vary in tandem.
Covariance refers to how two variables vary together:
Generally, if cups of coffee goes up, what does happiness tend to do?
Generally, if happiness goes up, what does cups of coffee tend to do?
Professor Brocker wants to know if how many cups of coffee a professor drinks daily impacts their happiness. She asks
50
professors how many cups they have in a typical day and then has them rate their happiness on a scale of1 to 10
.
Variable 1: Number of Cups of Coffee
Variable 2: Happiness
\[ COV_{XY} = \frac{\Sigma(X-\bar{X}(Y-\bar{Y})}{N-1} \]
\[ COV_{XY} = \frac{\Sigma(X-\bar{X}(Y-\bar{Y})}{N-1} \]
\[ s^2 = \frac{\Sigma(X-\bar{X})^2}{N-1} \]
\[ COV_{XY} = \frac{\Sigma(X-\bar{X}(Y-\bar{Y})}{N-1} \]
\[ s^2 = \frac{\Sigma(X-\bar{X})^2}{N-1} = \frac{\Sigma(X-\bar{X})(X-\bar{X})}{N-1} \]
The variance of a variable is the covariance of that variable with itself.
\[ COV_{XY} = \frac{\Sigma(X-\bar{X}(Y-\bar{Y})}{N-1} \]
Covariance is the variance of two variables together… how they move together.
\[ COV_{XY} = \frac{\Sigma(X-\bar{X})(Y-\bar{Y})}{N-1} \]
Let’s break it down.
\[\bar{X} = \frac{\Sigma(x)}{n}\] \[\bar{Y} = \frac{\Sigma(Y)}{n}\]
Let’s break it down.
Find the mean of each variable.
Subtract the mean of x from each x value.
Subtract the mean of y from each y value.
\[(X-\bar{X})\] \[(Y-\bar{Y})\]
Let’s break it down.
Find the mean of each variable.
Subtract the mean of x from each x value.
Subtract the mean of y from each y value.
Multiple the deviation scores of x with the deviation scores of y.
\[(X-\bar{X})(Y-\bar{Y})\]
Let’s break it down.
Find the mean of each variable.
Subtract the mean of x from each x value.
Subtract the mean of y from each y value.
Multiple the deviation scores of x with the deviation scores of y.
Sum the products of the deviation scores.
\[\sum(X-\bar{X})(Y-\bar{Y})\]
Let’s break it down.
Find the mean of each variable.
Subtract the mean of x from each x value.
Subtract the mean of y from each y value.
Multiple the deviation scores of x with the deviation scores of y.
Sum the products of the deviation scores.
Divide by N-1.
\[\frac{\sum(X-\bar{X})(Y-\bar{Y})}{N-1}\]
Covariance is the bivariate equivalent of variance.
What does the covariance tell us?
What did the variance tell us?
For it to be useful, we had to standardize it.
\[ COV_{XY} = \frac{\Sigma(X-\bar{X})(Y-\bar{Y})}{N-1} \]
Correlation is the standardized covariance.
Correlation is the bivariate version of the standard deviation.
To calculate the correlation, we divide the covariance by the standard deviation of x multipled by the standard deviation of y.
\[\frac{cov_{xy}}{(SD_x)(SD_y)}\]
\[ COV_{XY} = \frac{\Sigma(X-\bar{X})(Y-\bar{Y})}{N-1} \] \[\frac{cov_{xy}}{(SD_x)(SD_y)}\]
\[ \frac{\Sigma(X-\bar{X})(Y-\bar{Y})}{(SD_x)(SD_y)(N-1)} \]
Correlation is the bivariate version of the standard deviation.
\[ \frac{\Sigma(X-\bar{X})(Y-\bar{Y})}{(SD_x)(SD_y)(N-1)} \]
Correlation is bounded between -1 and +1.
Closer to -1 is a negative correlation.
Closer to +1 is a positive correlation.
Closer to 0 means no correlation.
This is the Pearson Product-Moment Correlation (Pearson, 1895).
The sign indicates the direction.
The number indicates the magnitude/strength of the relationship.
0.1 is a small correlation
0.3 is a moderate correlation
0.5 or higher is a large correlation
Correlation is written as r
If we square r, we get \(R^2\)
\(R^2\) is always positive
\(R^2\) refers to:
The amount of x accounted for by y.
The amount of y accounted for by x.
Students grades in MTH 110 and their grades in correlation at r = 0.3
Moderate correlation
How much of your grade is accounted for by your MTH 110 grade?
0.3 x 0.3 = 0.09
\(R^2\) = 9% of your grade is explained by your MTH 110 grade.
We don’t use the words caused/resulting from.
When reporting r: “As x goes up, y tends to go up.”
When reporting \(R^2\): “z% of x is accounted for by y.”
In the Method, we report descriptive statistics about our variables (M, s).
It is common to refer readers to a table where they can look at correlations between the variables.
Why might this be important?
Table 1 in a paper is typically a combination of descriptive stats AND correlation matrix, like this:
M | s | 1 | 2 | 3 | |
---|---|---|---|---|---|
Age | 20.9 | 1.76 | 1 | -- | -- |
Song 1 | 4.21 | 1.2 | .03 | 1 | -- |
Song 2 | 6.42 | 2.18 | .65** | -.34* | 1 |
Table 1 in a paper is typically a combination of descriptive stats AND correlation matrix, like this:
M | s | 1 | 2 | 3 | |
---|---|---|---|---|---|
Age | 20.9 | 1.76 | 1 | -.03 | .65** |
Song 1 | 4.21 | 1.2 | -.03 | 1 | -.34* |
Song 2 | 6.42 | 2.18 | .65** | -.34* | 1 |
In a psychology experiment, researchers collected data on the number of hours spent studying and the scores on a final exam for
50
students. Below is a scatterplot showing the relationship between the two variables.
Using the same data from the previous question, the researchers fit a linear regression model to predict exam scores from the number of hours spent studying. The \(R^2\) value for the model is
0.72
.
What does an \(R^2\) value of 0.72
tell you about how well hours spent studying explain the variance in exam scores? What percentage of the variance is explained by the model, and what percentage remains unexplained?
Imagine you are the researcher. Based on the \(R^2\) value of 0.72
, would you consider study time to be a good predictor of exam performance? Why or why not?
⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡