Here are some sample questions:
Now, let X be the number on top of the die, and let Y be the number on the bottom of the die. (It is traditional to use X and Y for the two variables you are comparing, so you will have to get used to X and Y meaning different quantities in different examples.) It happens that dice are constructed so that the numbers on opposite faces add up to 7.
Now toss three dice. Let X be the sum of the first two dice; let Y be the sum of the second and third dice. You would expect there to be some relationship between these two cases, since they both depend on the value that appears on the middle die. However, the first and third dice are independent of each other, so knowing X will not allow you to make perfect predictions of Y. That is often the way of the world: there is some relationship, but not a perfect relationship.
The branch of statistical analysis involved with fitting lines that represent a pattern of dots is called regression analysis.
In order to write a mathematical equation that describes a line, we need to know two numbers: the slope, and the Y intercept. (The Y intercept is also called the constant term).
If we let a represent the Y intercept and b represent the slope,
then the equation can be written:
y=a+bx
Our mission is to find the values of a and b associated with the line that best fits the data. If there are only two points, then it is always possible to find a line that fits perfectly; if there are more than two points, this is unlikely. No matter how hard you try, there will usually be some deviation between the points and the line you are using to try to represent them. We want to choose the line that minimizes these deviations (called errors.) The normal procedure is to square the errors, add them up, and minimize the result.
This sounds like a lot of work, but fortunately we can turn it over to the computer. Here is the general procedure for regression analysis:
We need more than just a description of the best line: we need a way to measure whether it is very good. (The best possible line still might not be very good.) The computer will report a value known as the r-squared value to indicate how well the relationship fits. These are the properties of the r-squared value:
| . | . | . | . | . | . | . |. |____________________ r-squared=1; positive slope to regression line |. | . | . | . | . | . | . | . |____________________ r=squared=1; negative slope to regression line | | | ......... | ......... | ......... | ......... | ......... | |____________________ r-squared=0Suppose a contest is to be held between Rosencrantz and Guildenstern. Both will be trying to guess the value of a variable Y. Rosencrantz has no information except for the average value of Y. Guildenstern, on the other hand, knows in advance the value of X, and he knows the regression equation connecting Y and X. The question is: will Guildenstern do better than Rosencrantz at guessing Y? If the r-squared value is 0, then knowing X provides no help at forecasting the value of Y. If the r-squared value is one, then knowing X allows you to make a perfect forecast for the value of Y (assuming that the situation has not changed between the time you collected your data and the time you're making your forecast).
Click here for some examples of scatter graphs. For example:
So far we have considered examples where we know the process generating the variables (dice), or where we know data for the complete population (the states). More commonly, we only have data from a sample, and we will use our analysis of the relationship visible in that sample to predict a relationship that would be visible in the population. Sometimes our sample will be limited because of time. We have no way of observing data from the future, so we are limited to using observations from the present and past if we want to discover a relationship between the variables. If the future relationship will stay the same as it was in the past (except for the same level of random variation), then we can use our past observations to predict the future relationship. Other times we may be looking at data from a sample of people (or other type of object) to try to predict properties of the entire population.
Our assumption in regression analysis is that there exists a relationship
between two variables X and Y that can be described by this equation:
Y=a+bX+e
where a and b are unknown parameters. X is called the
independent variable; Y is the dependent variable.
The assumption is that variations in X are responsible for causing
the variation in Y. However, you must be careful with this
assumption, since the mere fact you have established a relationship
between X and Y does not mean that X causes the changes in Y.
It might be that Y causes the changes in X, or it could be that
there is a third unidentified factor that causes changes in both
X and Y. (For example, any two variables that both grow with time will
appear to have a relationship, even if they are totally independent.)
Or you might even have bad luck with your sample. Your sample might
indicate there is a relationship between X and Y when in fact there
is no such relationship in the population. The chance of that happening
is small if the sample is large enough, but if you make a career
of performing regression analysis that kind of bad luck is bound to
happen occaisonally.
e is a random variable called the error term. If X was the only variable that affected Y, then no error term would be needed, and we could find a perfect fit to our regression line. However, there almost always will be other factors affecting Y that we don't observe. If we don't have any information about these, we have to assume that their effect can be described as random chance. The assumption is that e is a random variable with a normal distribution, 0 mean, and unknown variance sigma-squared. The assumption of 0 mean is not restrictive. (If by some chance e had nonzero mean, then this mean could be added to the constant a, which would redefine e as a new random variable that would have zero mean.) The assumption that the distribution of e is normal can be tested by looking at the residuals. It is unfortunate that the true value of sigma-squared is unknown, but, as is typical in statistics, we will try to estimate it. If this regression is to be much good in explaining the relationship between X and Y, then sigma-squared needs to be relatively small. Another way of saying it: if the random variable e contributes a large part of the variance of Y, then it means there are other factors influencing Y in addition to X, and you somehow should track those factors down and include them in your analysis. (``What if we find there is more than one quantity that affects Y?" you might be wondering. Look ahead to the section on multiple regression.)
After you have collected the observations for your sample and fed them into the computer, the computer will return the results of the regression calculation: the slope and constant of the regression line. However, these values are not necessarily the same as the true values a and b. We would know those true values only if we could observe the entire population. Instead, we use the regression coefficients from the sample as estimators for unknown parameters. That means we can perform hypothesis tests on these estimators.
Simple regression analysis, as we have discussed so far, includes only one independent variable and one dependent variable. Sometimes, though, one independent variable simply isn't enough. For example, suppose that the price of a house (Y) in a particular city depends on four variables: square feet in house ( X1 ); distance to business center ( X2 ); distance to nearest school ( X3 ); and the interest rate ( X4 ). Assume that the dependent variable Y is related to the four independent variables according to an equation of this form:
Y=B0+B1X1+ B2X2+ B3X3+ B4X4+e
B0 is the constant term, analogous to the Y-intercept term
in simple regression. B1 ,
B2 , B3 ,
and B4 are
called the coefficients. The true value of all of the B 's are unknown;
we will estimate them based on our regression calculation.
e again
represents an error term; assume that it has a normal distribution with
mean 0 and unknown variance sigma-squared.
e represents all factors affecting house prices other than the
four we have included. The value of sigma-squared should be relatively
small; otherwise, there are important factors influencing house prices
that we have not included and our regression equation will not be able
to do a very good job of predicting house prices.
There are two important differences between simple regression and multiple regression:
The computer will also report a standard error for each coefficient in the multiple regression. A larger value for the standard error means that there is more uncertainty about the true value of that coefficient. Dividing the estimated coefficient by the corresponding standard error gives a quantity known as the t statistic, which is used for hypothesis tests about whether or not a particular variable belongs in the regression equation. If the true value of that coefficient is zero, then the t statistic will come from a t distribution with n-m degrees of freedom. ( n is the number of observations; m is the number of coefficients that are estimated, including the constant term). If the absolute value of the reported t statistic is greater than the absolute value of the critical value from the t distribution table, then reject the null hypothesis: the true coefficient is not zero, and the variable belongs.
For any reasonable value of the degrees of freedom, the value from the t distribution table will be close to 2. Therefore, the approximate rule is:
The computer will also report an F statistic, which is used to test the hypothesis that the coefficient of all of the independent variables are zero. If the null hypothesis is true, and the coefficients are all zero (meaning the regression calculation is worthless for predicting the value of Y), then the F statistic will come from an F distribution with m-1 numerator degrees of freedom and n-m denominator degrees of freedom. If the reported value is greater than the critical value for those degrees of freedom, then reject the null hypothesis.
In practice the difficult matter with regression analysis is determining
which variables to include, and the exact form to use for the equation.
You can test to see whether a variable that is included really
belongs; however, you might have left out variables that should
be included. The restriction that the equation be linear is not
a big problem; if the true relationship involves a quadratic curve, such as
Y=aX1+bX1^2
then simply include both X1 and
X1-squared as independent variables in your regression analysis.
If the true relation is of the form
Y=X1^B1
X2^B2
take the logarithm of each side to convert
it to a linear form:
log Y = B1 log X1 + B2 log X2
Some other problems that can arise with multiple regression include multicollinearity, (when two or more of the independent variables are highly correlated); heteroscedasticity (when the variances of the error terms are different for different observations); and serial correlation (when the errors for successive observations of time series data are correlated with each other). These problems make it more difficult for the regression calculation to accurately estimate the coefficients.
Here is some sample data. However, in reality it would not be very reliable to perform a regression calculation when the number of observations is this small.
X1 X2 X3 X4 Y
2 1 3 0 12
3 3 8 0 23
2 5 3 1 29
2 5 7 -1 27
5 1 3 -1 20
5 1 8 0 21
4 6 4 1 39
5 5 8 0 37
The value of Y is given by the equation:
However, remember that in reality we will not be able to see the true equation as we can in this artificial example. If we knew that X1 , X2 , and X4 all should be included, then we would run the regression with those independent variables and we would find an r-squared value of 1, with each of the true coefficients found exactly. However, in reality we don't know for sure which variables should be included. Suppose we perform a multiple regression calculation with X1 , X2 , X3 , and X4 as the independent variables. The results are:
SUMMARY OUTPUT Regression Statistics Multiple R 1 R Square 1 Adjusted R Square 1 Standard Error 1.34894E-14 Observations 8 ANOVA df SS MS F Significance F Regression 4 566 141.5 7.77627E+29 2.36796E-45 Residual 3 5.45892E-28 1.81964E-28 Total 7 566 Coefficients Standard Error t Stat P-value Intercept 2 1.86004E-14 1.07525E+14 1.77398E-42 X1 3 3.89823E-15 7.6958E+14 4.83848E-45 X2 4 2.82546E-15 1.4157E+15 7.77249E-46 X3 -2.75345E-16 2.31871E-15 -0.118749562 0.912978976 X4 1 7.91699E-15 1.26311E+14 1.09434E-42There is a perfect fit. The t statistics for all variables except X3 are huge, indicating they all belong. Also, the regression correctly indicates that X3 does not belong, because its t statistic (-0.1187) is between -2 and 2. Suppose we perform a multiple regression calculation with X1 and X2 as the independent variables. The results are:
SUMMARY OUTPUT Regression Statistics Multiple R 0.997160976 R Square 0.994330011 Adjusted R Square 0.992062016 Standard Error 0.801150874 Observations 8 ANOVA df SS MS F Significance F Regression 2 562.7907864 281.3953932 438.4179848 2.42078E-06 Residual 5 3.209213615 0.641842723 Total 7 566 Coefficients Standard Error t Stat P-value Intercept 1.558098592 1.033917083 1.506986021 0.192171443 X1 2.977992958 0.219146535 13.5890488 3.86799E-05 X2 4.153755869 0.145235749 28.60009257 9.79027E-07The r-squared value is 0.9943. We do not have a perfect fit, because we left out the variable X4 . However, X4 has only a very small influence on Y, so leaving it out has not hurt our regression equation noticeably. The estimated coefficients (2.978 and 4.1538) are close to the true values (3 and 4, respectively). The t statistics need to be compared against a t distribution with 8-3=5 degrees of freedom, which gives a critical value of 2.571 using the 5 percent significance level. The two t statistics (13.589 and 28.6) are way above the critical value, so we can clearly reject the hypothesis that the true coefficients are zero.
The F statistic for this regression is reported to be 438.4; this needs to be compared against an F distribution with 3-1=2 numerator degrees of freedom and 8-3=5 denominator degrees of freedom. The critical value is 5.79. The observed value is way above this limit, so the null hypothesis that both coefficients are truly zero can clearly be rejected.
Now suppose we perform a regression calculation with X2 and X3 as the independent variables. We know from the true equation that X3 doesn't belong in the equation, but X1 and X4 do. Unfortunately, the researcher in the field does not see the true equation, and will not always know if important variables have been left out. In this case the resulting regression equation is:
SUMMARY OUTPUT Regression Statistics Multiple R 0.894506707 R Square 0.800142248 Adjusted R Square 0.720199148 Standard Error 4.756458501 Observations 8 ANOVA df SS MS F Significance F Regression 2 452.8805126 226.4402563 10.00889686 0.017856753 Residual 5 113.1194874 22.62389747 Total 7 566 Coefficients Standard Error t Stat P-value Intercept 11.06633999 5.021584201 2.203754741 0.078720841 X2 3.683377309 0.846359221 4.35202597 0.007345174 X3 0.454956653 0.737318586 0.61704216 0.564217878The r-squared value falls to .8001. The estimated coefficient for X2 is still close to its true value of 4; its t statistic (4.352) is still above the critical value so we reject the null hypothesis that the true coefficient of X2 is zero. The t statistic for X3 falls inside the interval -2.571 to 2.571, so we accept the null hypothesis that the true coefficient of X3 is zero. This happens to be correct, because we know that X3 is not included in the true equation. However, the regression results provide no way of testing for the fact that there are variables that should be included ( X1 and X4 ) but are missing. In this case the missing variables do not hurt us too badly, but in other cases missing variables can wreak havoc on our ability to estimate the coefficients of the variables that are included.