This file is part of a program based on
the Bio 4835 Biostatistics class taught at
Daniel, W. W. 1999. Biostatistics: a foundation for analysis in the health sciences.
The file follows this text very closely and readers are encouraged to consult the text for further information.
REGRESSION AND CORRELATION
Regression and correlation analysis procedures are used to study the relationships between variables. Regression is used to predict the value of one variable based on the value of a different variable. Correlation is a measure of the strength of a relationship between variables. The variables are data which are measured and/or counted in an experiment. In the case of the examples used here, the data were obtained by counting the breathing rate of goldfish in a laboratory experiment.
Nature of data
The data for regression and correlation consist of pairs in the form (x,y). The independent variable (x) is determined by the experimenter. This means that the experimenter has control over the variable during the experiment. In our experiment, the temperature was controlled during the experiment. The dependent variable (y) is the effect that is observed during the experiment. It is assumed that the values obtained for the dependent variable result from the changes in the independent variable. Regression and correlation analyses will determine the nature of this relationship, if any, and the strength of the relationship. It can be a consideration that all of the (x,y) pairs form a population. In some experiments, numerous observations of y are taken at each value of x. In these cases, each set of values of y taken at a particular value of x form a subpopulation of the data.
Data are represented using a plot called a scatter plot or scatter diagram or x-y plot. During analysis we try to find the equation of a line that fits the data. This is called the regression line. From algebra, we recall that points which are (x,y) pairs can be plotted on the Cartesian coordinate system. We also recall that a straight line on the Cartesian coordinate system has the equation y = mx + b, where m is the slope of the line, and b is the y-intercept of the line. The slope is always the coefficient of the x term in the equation. Following this pattern, the slope of the regression line can be given using various forms of equations, for example, y = ax + b, y = a + bx, etc. By looking at the equation we can determine the slope and the y-intercept.
Scatter diagrams can show a direct relationship between x and y. Let us have an equation, y = a + bx where b is the slope. The direct relationship exists when the slope of the line (b) is positive. An inverse relationship exists when the slope of the line is negative. When the slope
b=0, then there is no relationship. The nature of the relationship is discussed as part of correlation.
The regression line
As noted above, a straight line plotted on the Cartesian coordinate system can have the equation y = mx + b. We remember that m is the slope and b is the y-intercept.
A regression line will have a general form
y = a + bx + e
a is the y-intercept
b is the slope of the line
e is an error term
In practice, under ordinary circumstances, we do not know the value of the error term so we use the following form of the equation
y = a + bx
although alternative forms (such as y + ax + b) will also yield the same results.
From the study of correlation we learn that when the slope of the regression line is positive (meaning that the value of b is positive) the value of y increases as the value of x increases. This is called a positive correlation. When the slope of the regression line is negative (meaning that the value of b is negative) the value of y decreases as x increases. The strength of these relationships is given by the correlation coefficient (r) which can be calculated.
Regression and correlation work depends on a set of calculations. These are done by taking the distance that a point is from the theoretical regression line and squaring it. By adding these squares you obtain the sum of squares. Sum of squares information can be determined by calculating basic statistics on the data of the dependent and independent variables. Refer to the section on calculations for regression and correlation.
Regression analysis is used to predict the value of the variable based on the value of a second variable which is controlled by the experimenter. Results may be plotted on a scatter plot as noted earlier.
Data for regression analysis are in the form of (x,y) pairs which can be listed in two columns to form a data table. In the following example, opercular breathing rates (in counts per minute) were measured in the biology laboratory. Counts were made at various temperatures ranging from 9°C to 27°C. The data are presented in the figure below.
A linear equation in the form of y = mx + b can be calculated for these data. Sometimes the equation is given as y = ax + b and other times it is given as y = a + bx. No matter which form is used, we are interested in the coefficient accompanying the variable (x).
A. Breathing rate vs temperature B. Scatter plot of data
Lists of (A) opercular breathing rate of goldfish vs temperature and (B) scatter plot of the data.
The coefficient accompanying the variable is sometimes called the regression coefficient. The other term, the constant, is the y-intercept where the regression line crosses the y-axis.
For the sample data shown in A, above, the linear regression equation can be calculated as shown in the figure below.
Linear regression calculation for goldfish breathing data.
Based on this information, we learn that the equation for the regression line of these data is
y = 4.54x - 1.57
The calculator also presents us with two additional pieces of information, r and r2, which are described under correlation.
Using the regression equation
The regression equation can be used to predict the breathing rate of goldfish within certain reasonable limits. For example, if the temperature were 19.5°C (= x) we could calculate the breathing rate (= y).
y = 4.54x - 1.57
= 4.54 (19.5) - 1.57
Similarly, if x were 11, y could be calculated.
y = 4.54x - 1.57
= 4.54 (11) - 1.57
In each case, the desired value was within the range of the x values. Finding intermediate values this way is called interpolation. Finding values outside the range of the x values is called extrapolation. Regression calculations have limitations. For example we could calculate the breathing rate at 28°C.
y = 4.54x - 1.57
= 4.54 (28) - 1.57
There are limitations on using regression equations for prediction. In this example, one can put a temperature like 42 or 100 into the equation and get an answer, but one must remember that many enzymes stop functioning at 42°C and at a temperature of 100°C, water boils and the fish would be cooked.
Regression analysis theory indicates that the safest place to obtain interpolation is in the middle of the range of the x values. It is less secure at the ends of the range. One should be cautious with extrapolation because the results become more and more unreliable very quickly as one goes further away from the range of the x values.
Significance of regression analysis
It is possible to perform a linear regression t test. Testing the relationship involves a null hypothesis that there is no relationship. In stating the hypotheses, b is the population regression coefficient and r is the population correlation coefficient.
Calculation of the linear regression t test gives a t value of 9.62, with probability, p = 2.06 x 10-4. This permits us to say that the result is significant. There is a very low probability that it occurred by chance. See the figure below.
A. Calculation setup B. Results
Linear regression t test.
Correlation is used to give information about the relationship between x and y. When the regression equation is calculated, the corelation results indicate the nature and the strength of the relationship.
The correlation coefficient, r, indicates the nature and strength of the relationship between x and y. Values of r range from -1 to +1. A correlation coefficient of 0 means that there is no relationship. A value of -1 is a perfect negative coefficient and a correlation value of +1 indicates a perfect positive correlation.
A. Perfect negative B. No correlation C. Perfect positive
correlation r = 0 correlation
r = -1 r = +1
Examples of correlational relationship.
Coefficient of determination
Another value of use in correlation analysis is the coefficient of determination which is represented as r2. Because it is a square, it is always a positive number and varies between 0 and 1.
The coefficient of determination gives an indication of the contribution of the factor being studied in the regression analysis to the relationship between x and y. In the case of goldfish data, the regression analysis results in an equation of
y = 4.54x - 1.57
We see that as the temperature increases, the breathing rate increases. For these data,
r = .974077366
r2 = .948691071
The value of the correlation coefficient, r, is 0.97. This indicates a very strong positive correlation between temperature and breathing rate. The coefficient of determination, r2, has a value of .948. This indicates that about 95% of the relationship is the result of the temperature which is the factor being considered in this activity.
Calculations for regression and correlation
The general form for the regression line is
y = a + bx + e
This equation provides a model that can be used to predict relationships between x and y. When the regression equation is calculated in the form y = a + bx, the method used to obtain the result is known as the method of least squares. Return to Figure 35, page 148 for the data and scatter plot of the goldfish opercular breathing rates.
A close examination of the figure below shows that there is a very close relationship between the points and their regression line. Note, however, that in no case does the line pass directly through any point.
A. Scatter plot B. Scatter plot with regression line
Relationship of breathing rate to temperature in goldfish.
Each point lies a little distance above or below the line. This distance is called a residual and is the difference between the actual y value and its value according to the regression line. When these values are squared, the formulas give the equation of the line which makes these squares at their total minimum value. The discussion below will show where the values that the calculator gives you come from.
Residuals are very useful in giving information about the data. If there is a relationship trend underlying the data, it will show up in the residuals. Data that give a good regression line have residuals that are randomly scattered on a residual plot. If there is a trend on the residual plot then something else is going on which needs attention in the data.
In order to do a regression and correlation calculation, certain items of information are necessary. These are listed in Table XIII along with the settings used on the TI-83 calculator. It is assumed that the x values are stored in list L1 and the y values are stored in list L2.
Data used in regression and correlation calculation.
These items of information will be used to calculate the equation of the regression line, the correlation coefficient (which is also used to obtain the coefficient of determination) and the variance of the residuals.
Calculate the regression equation
The regression equation is in the form y = a + bx. It is found using the values in the table above. The value of b is calculated first, then the value of a is obtained using the value obtained for b.
Calculation of b.
Calculation of a.
These two results give us the equation of the regression line in figure B which is
y = 4.53x - 1.57
Calculate the correlation coefficient
The correlation coefficient is given by the formula below which is sued in its calculaton.
The value of r obtained from this calculation is the correlation coefficient. The square of the correlation coefficient is the coefficient of determination.
Calculation of residual variance
For the calculation of the residual variance, some additional formulas are used. These are listed in the following table.
Information used in calculating residual variance.
The formula for residual variance is given below as is the calculation for the sample.
So the residual variance is 56.1365714. By taking the square root of this variance, a value of 7.492434279 is obtained. This is the residual standard deviation that is found using the linear regression t test on the TI83. The calculator will give all of the results obtained during the discussion above. The figure below gives the two screens resulting from the linear regression t test.
First Screen Second Screen
Results of linear regression t test.