Regression Analysis

Douglas Downing

Seattle Pacific University, Business 3700

How can we tell if two quantities are related to each other? In other words, do they move together, so that one quantity is large when the other is large (and vice versa)? Another way to word it: does knowledge of one of the quantities help you predict the value of the other?

Here are some sample questions:

Let X be the number that appears on the first die, and let Y be the number that appears on the second die. Knowledge of X provides you with no help in predicting the value of Y.

Now, let X be the number on top of the die, and let Y be the number on the bottom of the die. (It is traditional to use X and Y for the two variables you are comparing, so you will have to get used to X and Y meaning different quantities in different examples.) It happens that dice are constructed so that the numbers on opposite faces add up to 7.

Now toss three dice. Let X be the sum of the first two dice; let Y be the sum of the second and third dice. You would expect there to be some relationship between these two cases, since they both depend on the value that appears on the middle die. However, the first and third dice are independent of each other, so knowing X will not allow you to make perfect predictions of Y. That is often the way of the world: there is some relationship, but not a perfect relationship.

    

Regression

    
The branch of statistical analysis involved with fitting lines that represent a pattern of dots is called regression analysis.

In order to write a mathematical equation that describes a line, we need to know two numbers: the slope, and the Y intercept. (The Y intercept is also called the constant term).

If we let a represent the Y intercept and b represent the slope, then the equation can be written:
y=a+bx

Our mission is to find the values of a and b associated with the line that best fits the data. If there are only two points, then it is always possible to find a line that fits perfectly; if there are more than two points, this is unlikely. No matter how hard you try, there will usually be some deviation between the points and the line you are using to try to represent them. We want to choose the line that minimizes these deviations (called errors.) The normal procedure is to square the errors, add them up, and minimize the result.

This sounds like a lot of work, but fortunately we can turn it over to the computer. Here is the general procedure for regression analysis:

  1. Assemble your observations for the two quantities. Each observation of X must be matched with one observation of Y; for example, they could both come from the same state, or the same person, or the same year.
  2. Input the data into a computer program, such as a spreadsheet program or statistics program.
  3. Have the computer create a scatter graph of the data. This step is important because it allows you to visualize whether or not there seems to be a relationship.
  4. Call the command that performs a simple regression analysis on the data. (It is called simple because there is only independent variable; later we will use multiple regression analysis.)
  5. The computer will report back values for the slope and the intercept, along with some other information that describes the result of the regression analysis (to be discussed shortly).
    

The r squared value

    
We need more than just a description of the best line: we need a way to measure whether it is very good. (The best possible line still might not be very good.) The computer will report a value known as the r-squared value to indicate how well the relationship fits. These are the properties of the r-squared value:
  1. r-squared is always between 0 and 1.
  2. r-squared=1 if all of the observations fit along a straight line, as when we looked for the relationship between the top number and bottom number on a die
  3. r-squared=0 if the two quantities are completely independent, as when we looked for a relationship between the numbers on two different dice
  4. The r-squared value gives the percent of variation in Y that can be accounted for by variations in X.
Here are some examples:

|              .
|            .
|          .
|        .
|      .
|    .
|  .
|.
|____________________

r-squared=1; positive slope to regression line




|.              
| .         
|  .        
|   .     
|    .  
|     .
|      .  
|       .
|____________________
r=squared=1; negative slope to regression line




|              
|          
|  .........        
|  .........     
|  .........  
|  ......... 
|  .........  
|       
|____________________
r-squared=0

Suppose a contest is to be held between Rosencrantz and Guildenstern. Both will be trying to guess the value of a variable Y. Rosencrantz has no information except for the average value of Y. Guildenstern, on the other hand, knows in advance the value of X, and he knows the regression equation connecting Y and X. The question is: will Guildenstern do better than Rosencrantz at guessing Y? If the r-squared value is 0, then knowing X provides no help at forecasting the value of Y. If the r-squared value is one, then knowing X allows you to make a perfect forecast for the value of Y (assuming that the situation has not changed between the time you collected your data and the time you're making your forecast).
    

Examples of simple regression calculations

    
Click here for some examples of scatter graphs. For example:
    

Statistical analysis of regression

    
So far we have considered examples where we know the process generating the variables (dice), or where we know data for the complete population (the states). More commonly, we only have data from a sample, and we will use our analysis of the relationship visible in that sample to predict a relationship that would be visible in the population. Sometimes our sample will be limited because of time. We have no way of observing data from the future, so we are limited to using observations from the present and past if we want to discover a relationship between the variables. If the future relationship will stay the same as it was in the past (except for the same level of random variation), then we can use our past observations to predict the future relationship. Other times we may be looking at data from a sample of people (or other type of object) to try to predict properties of the entire population.

Our assumption in regression analysis is that there exists a relationship between two variables X and Y that can be described by this equation:
Y=a+bX+e
where a and b are unknown parameters. X is called the independent variable; Y is the dependent variable. The assumption is that variations in X are responsible for causing the variation in Y. However, you must be careful with this assumption, since the mere fact you have established a relationship between X and Y does not mean that X causes the changes in Y. It might be that Y causes the changes in X, or it could be that there is a third unidentified factor that causes changes in both X and Y. (For example, any two variables that both grow with time will appear to have a relationship, even if they are totally independent.) Or you might even have bad luck with your sample. Your sample might indicate there is a relationship between X and Y when in fact there is no such relationship in the population. The chance of that happening is small if the sample is large enough, but if you make a career of performing regression analysis that kind of bad luck is bound to happen occaisonally.

e is a random variable called the error term. If X was the only variable that affected Y, then no error term would be needed, and we could find a perfect fit to our regression line. However, there almost always will be other factors affecting Y that we don't observe. If we don't have any information about these, we have to assume that their effect can be described as random chance. The assumption is that e is a random variable with a normal distribution, 0 mean, and unknown variance sigma-squared. The assumption of 0 mean is not restrictive. (If by some chance e had nonzero mean, then this mean could be added to the constant a, which would redefine e as a new random variable that would have zero mean.) The assumption that the distribution of e is normal can be tested by looking at the residuals. It is unfortunate that the true value of sigma-squared is unknown, but, as is typical in statistics, we will try to estimate it. If this regression is to be much good in explaining the relationship between X and Y, then sigma-squared needs to be relatively small. Another way of saying it: if the random variable e contributes a large part of the variance of Y, then it means there are other factors influencing Y in addition to X, and you somehow should track those factors down and include them in your analysis. (``What if we find there is more than one quantity that affects Y?" you might be wondering. Look ahead to the section on multiple regression.)

After you have collected the observations for your sample and fed them into the computer, the computer will return the results of the regression calculation: the slope and constant of the regression line. However, these values are not necessarily the same as the true values a and b. We would know those true values only if we could observe the entire population. Instead, we use the regression coefficients from the sample as estimators for unknown parameters. That means we can perform hypothesis tests on these estimators.

    

Multiple regression

    
Simple regression analysis, as we have discussed so far, includes only one independent variable and one dependent variable. Sometimes, though, one independent variable simply isn't enough. For example, suppose that the price of a house (Y) in a particular city depends on four variables: square feet in house ( X1 ); distance to business center ( X2 ); distance to nearest school ( X3 ); and the interest rate ( X4 ). Assume that the dependent variable Y is related to the four independent variables according to an equation of this form:

Y=B0+B1X1+ B2X2+ B3X3+ B4X4+e

B0 is the constant term, analogous to the Y-intercept term in simple regression. B1 , B2 , B3 , and B4 are called the coefficients. The true value of all of the B 's are unknown; we will estimate them based on our regression calculation.
e again represents an error term; assume that it has a normal distribution with mean 0 and unknown variance sigma-squared.
e represents all factors affecting house prices other than the four we have included. The value of sigma-squared should be relatively small; otherwise, there are important factors influencing house prices that we have not included and our regression equation will not be able to do a very good job of predicting house prices.

There are two important differences between simple regression and multiple regression:

Much of the procedure for multiple regression is the same as for simple regression:
  1. Assemble your observations for all variables. There are five total variables in our example. Each observation will give you one value for each of the five variables. In our example, each observation corresponds to one house sale.
  2. Input the data into a computer program, such as a spreadsheet program or statistics program.
  3. Call the command that performs a multiple regression analysis on the data.
  4. The computer will report back values for the coefficient for each of the independent variables and the constant term. It will also report an r-squared value which again tells what percentage of variation in the dependent variable is explained by the regression equation.
Hypothesis tests can be performed on the individual coefficients to test the relationship between an independent variable and the dependent variable. The equation can also be used to predict future values of Y if you know the future values of X1 , X2 , X3 , and X4 .
    

Hypothesis Testing with Multiple Regression

    
The computer will also report a standard error for each coefficient in the multiple regression. A larger value for the standard error means that there is more uncertainty about the true value of that coefficient. Dividing the estimated coefficient by the corresponding standard error gives a quantity known as the t statistic, which is used for hypothesis tests about whether or not a particular variable belongs in the regression equation. If the true value of that coefficient is zero, then the t statistic will come from a t distribution with n-m degrees of freedom. ( n is the number of observations; m is the number of coefficients that are estimated, including the constant term). If the absolute value of the reported t statistic is greater than the absolute value of the critical value from the t distribution table, then reject the null hypothesis: the true coefficient is not zero, and the variable belongs.

For any reasonable value of the degrees of freedom, the value from the t distribution table will be close to 2. Therefore, the approximate rule is:

The computer will also report an F statistic, which is used to test the hypothesis that the coefficient of all of the independent variables are zero. If the null hypothesis is true, and the coefficients are all zero (meaning the regression calculation is worthless for predicting the value of Y), then the F statistic will come from an F distribution with m-1 numerator degrees of freedom and n-m denominator degrees of freedom. If the reported value is greater than the critical value for those degrees of freedom, then reject the null hypothesis.

In practice the difficult matter with regression analysis is determining which variables to include, and the exact form to use for the equation. You can test to see whether a variable that is included really belongs; however, you might have left out variables that should be included. The restriction that the equation be linear is not a big problem; if the true relationship involves a quadratic curve, such as
Y=aX1+bX1^2
then simply include both X1 and X1-squared as independent variables in your regression analysis. If the true relation is of the form
Y=X1^B1 X2^B2
take the logarithm of each side to convert it to a linear form:
log Y = B1 log X1 + B2 log X2

Some other problems that can arise with multiple regression include multicollinearity, (when two or more of the independent variables are highly correlated); heteroscedasticity (when the variances of the error terms are different for different observations); and serial correlation (when the errors for successive observations of time series data are correlated with each other). These problems make it more difficult for the regression calculation to accurately estimate the coefficients.

Here is some sample data. However, in reality it would not be very reliable to perform a regression calculation when the number of observations is this small.

           X1       X2       X3       X4        Y        
            2        1        3        0       12     
            3        3        8        0       23     
            2        5        3        1       29     
            2        5        7       -1       27     
            5        1        3       -1       20     
            5        1        8        0       21     
            4        6        4        1       39     
            5        5        8        0       37          
The value of Y is given by the equation:
Y=2+3X1+4X2+X4

However, remember that in reality we will not be able to see the true equation as we can in this artificial example. If we knew that X1 , X2 , and X4 all should be included, then we would run the regression with those independent variables and we would find an r-squared value of 1, with each of the true coefficients found exactly. However, in reality we don't know for sure which variables should be included. Suppose we perform a multiple regression calculation with X1 , X2 , X3 , and X4 as the independent variables. The results are:

SUMMARY OUTPUT								
								
Regression Statistics								
Multiple R		1							
R Square		1							
Adjusted R Square	1							
Standard Error		1.34894E-14							
Observations		8							
								
ANOVA								
		df	SS		MS		F		Significance F			
Regression	4	566		141.5		7.77627E+29	2.36796E-45			
Residual	3	5.45892E-28	1.81964E-28					
Total		7	566						
								
	Coefficients		Standard Error	t Stat		P-value		
Intercept	2		1.86004E-14	1.07525E+14	1.77398E-42	
X1		3		3.89823E-15	7.6958E+14	4.83848E-45
X2		4		2.82546E-15	1.4157E+15	7.77249E-46	
X3		-2.75345E-16	2.31871E-15	-0.118749562	0.912978976	
X4		1		7.91699E-15	1.26311E+14	1.09434E-42	

There is a perfect fit. The t statistics for all variables except X3 are huge, indicating they all belong. Also, the regression correctly indicates that X3 does not belong, because its t statistic (-0.1187) is between -2 and 2. Suppose we perform a multiple regression calculation with X1 and X2 as the independent variables. The results are:
SUMMARY OUTPUT								
								
Regression Statistics								
Multiple R		0.997160976							
R Square		0.994330011							
Adjusted R Square	0.992062016							
Standard Error		0.801150874							
Observations		8							
								
ANOVA								
		df	SS		MS		F		Significance F			
Regression	2	562.7907864	281.3953932	438.4179848	2.42078E-06			
Residual	5	3.209213615	0.641842723					
Total		7	566						
								
		Coefficients	Standard Error	t Stat		P-value	
Intercept	1.558098592	1.033917083	1.506986021	0.192171443
X1		2.977992958	0.219146535	13.5890488	3.86799E-05
X2		4.153755869	0.145235749	28.60009257	9.79027E-07

The r-squared value is 0.9943. We do not have a perfect fit, because we left out the variable X4 . However, X4 has only a very small influence on Y, so leaving it out has not hurt our regression equation noticeably. The estimated coefficients (2.978 and 4.1538) are close to the true values (3 and 4, respectively). The t statistics need to be compared against a t distribution with 8-3=5 degrees of freedom, which gives a critical value of 2.571 using the 5 percent significance level. The two t statistics (13.589 and 28.6) are way above the critical value, so we can clearly reject the hypothesis that the true coefficients are zero.

The F statistic for this regression is reported to be 438.4; this needs to be compared against an F distribution with 3-1=2 numerator degrees of freedom and 8-3=5 denominator degrees of freedom. The critical value is 5.79. The observed value is way above this limit, so the null hypothesis that both coefficients are truly zero can clearly be rejected.

Now suppose we perform a regression calculation with X2 and X3 as the independent variables. We know from the true equation that X3 doesn't belong in the equation, but X1 and X4 do. Unfortunately, the researcher in the field does not see the true equation, and will not always know if important variables have been left out. In this case the resulting regression equation is:


SUMMARY OUTPUT								
								
Regression Statistics								
Multiple R		0.894506707							
R Square		0.800142248							
Adjusted R Square	0.720199148							
Standard Error		4.756458501							
Observations		8							
								
ANOVA								
		df	SS		MS		F		Significance F			
Regression	2	452.8805126	226.4402563	10.00889686	0.017856753			
Residual	5	113.1194874	22.62389747					
Total		7	566						
								
		Coefficients	Standard Error	t Stat		P-value	
Intercept	11.06633999	5.021584201	2.203754741	0.078720841	
X2		3.683377309	0.846359221	4.35202597	0.007345174
X3		0.454956653	0.737318586	0.61704216	0.564217878
								
The r-squared value falls to .8001. The estimated coefficient for X2 is still close to its true value of 4; its t statistic (4.352) is still above the critical value so we reject the null hypothesis that the true coefficient of X2 is zero. The t statistic for X3 falls inside the interval -2.571 to 2.571, so we accept the null hypothesis that the true coefficient of X3 is zero. This happens to be correct, because we know that X3 is not included in the true equation. However, the regression results provide no way of testing for the fact that there are variables that should be included ( X1 and X4 ) but are missing. In this case the missing variables do not hurt us too badly, but in other cases missing variables can wreak havoc on our ability to estimate the coefficients of the variables that are included.