*This file is part of a program based on
the Bio 4835 Biostatistics class taught at Kean
University in *

REGRESSION AND CORRELATION

**Introduction**

Regression and correlation analysis procedures are used to study the relationships between variables. Regression is used to predict the value of one variable based on the value of a different variable. Correlation is a measure of the strength of a relationship between variables. The variables are data which are measured and/or counted in an experiment. In the case of the examples used here, the data were obtained by counting the breathing rate of goldfish in a laboratory experiment.

**Nature of data**

The data for regression and correlation consist of pairs in the form (x,y). The independent variable (x) is determined by the experimenter. This means that the experimenter has control over the variable during the experiment. In our experiment, the temperature was controlled during the experiment. The dependent variable (y) is the effect that is observed during the experiment. It is assumed that the values obtained for the dependent variable result from the changes in the independent variable. Regression and correlation analyses will determine the nature of this relationship, if any, and the strength of the relationship. It can be a consideration that all of the (x,y) pairs form a population. In some experiments, numerous observations of y are taken at each value of x. In these cases, each set of values of y taken at a particular value of x form a subpopulation of the data.

**Graphical representation**

Data
are represented using a plot called a scatter plot or scatter diagram or x-y
plot. During analysis we try to find the
equation of a line that fits the data.
This is called the regression line.
From algebra, we recall that points which are (x,y)
pairs can be plotted on the Cartesian coordinate system. We also recall that a straight line on the
Cartesian coordinate system has the equation **y = mx
+ b**, where **m** is the slope of the line, and **b** is the
y-intercept of the line. The slope is
always the coefficient of the x term in the equation. Following this pattern, the slope of the
regression line can be given using various forms of equations, for example, y = ax + b, y = a + bx, etc. By looking
at the equation we can determine the slope and the y-intercept.

Scatter diagrams can show a direct relationship between x and y. Let us have an equation, y = a + bx where b is the slope. The direct relationship exists when the slope of the line (b) is positive. An inverse relationship exists when the slope of the line is negative. When the slope

b=0, then there is no relationship. The nature of the relationship is discussed as part of correlation.

**The regression line**

As noted above, a straight line plotted on the Cartesian coordinate system can have the equation y = mx + b. We remember that m is the slope and b is the y-intercept.

A regression line will have a general form

y = a + bx + e

where:

a is the y-intercept

b is the slope of the line

e is an error term

In practice, under ordinary circumstances, we do not know the value of the error term so we use the following form of the equation

y = a + bx

although alternative forms (such as y + ax + b) will also yield the same results.

From the study of correlation we learn that when the slope of the regression line is positive (meaning that the value of b is positive) the value of y increases as the value of x increases. This is called a positive correlation. When the slope of the regression line is negative (meaning that the value of b is negative) the value of y decreases as x increases. The strength of these relationships is given by the correlation coefficient (r) which can be calculated.

**Calculations**

Regression and correlation work depends on a set of calculations. These are done by taking the distance that a point is from the theoretical regression line and squaring it. By adding these squares you obtain the sum of squares. Sum of squares information can be determined by calculating basic statistics on the data of the dependent and independent variables. Refer to the section on calculations for regression and correlation.

**Regression analysis**

Regression analysis is used to predict the value of the variable based on the value of a second variable which is controlled by the experimenter. Results may be plotted on a scatter plot as noted earlier.

**Data**

Data for regression analysis are in the form of (x,y) pairs which can be listed in two columns to form a data table. In the following example, opercular breathing rates (in counts per minute) were measured in the biology laboratory. Counts were made at various temperatures ranging from 9°C to 27°C. The data are presented in the figure below.

**Regression equation**

A linear equation in the form of y = mx + b can be calculated for these data. Sometimes the equation is given as y = ax + b and other times it is given as y = a + bx. No matter which form is used, we are interested in the coefficient accompanying the variable (x).

A. Breathing rate vs temperature B. Scatter plot of data

*Lists of (A) opercular breathing
rate of goldfish vs temperature and (B) scatter plot of the data.*

The coefficient accompanying the variable is sometimes called the regression coefficient. The other term, the constant, is the y-intercept where the regression line crosses the y-axis.

For the sample data shown in A, above, the linear regression equation can be calculated as shown in the figure below.

*Linear regression calculation
for goldfish breathing data.*

Based on this information, we learn that the equation for the regression line of these data is

y = 4.54x - 1.57

The calculator also presents us
with two additional pieces of information, r and r^{2}, which are
described under correlation.

**Using the regression equation**

The regression equation can be used to predict the breathing rate of goldfish within certain reasonable limits. For example, if the temperature were 19.5°C (= x) we could calculate the breathing rate (= y).

y = 4.54x - 1.57

= 4.54 (19.5) - 1.57

= 86.96

Similarly, if x were 11, y could be calculated.

y = 4.54x - 1.57

= 4.54 (11) - 1.57

= 48.37

In each case, the desired value
was within the range of the x values.
Finding intermediate values this way is called **interpolation**. Finding values outside the range of the x
values is called **extrapolation**.
Regression calculations have limitations. For example we could calculate the breathing
rate at 28°C.

y = 4.54x - 1.57

= 4.54 (28) - 1.57

= 128.55

There are limitations on using regression equations for prediction. In this example, one can put a temperature like 42 or 100 into the equation and get an answer, but one must remember that many enzymes stop functioning at 42°C and at a temperature of 100°C, water boils and the fish would be cooked.

Regression analysis theory indicates that the safest place to obtain interpolation is in the middle of the range of the x values. It is less secure at the ends of the range. One should be cautious with extrapolation because the results become more and more unreliable very quickly as one goes further away from the range of the x values.

**Significance of regression
analysis**

It
is possible to perform a linear regression *t* test. Testing the relationship involves a null
hypothesis that there is no relationship.
In stating the hypotheses, b is the population
regression coefficient and r is the population
correlation coefficient.

**Hypotheses**

**Results**

Calculation
of the linear regression *t* test gives a *t* value of 9.62, with
probability, p = 2.06 x 10^{-4}.
This permits us to say that the result is significant. There is a very low probability that it
occurred by chance. See the figure below.

A. Calculation setup B. Results

** ***Linear regression t test.*

**Correlation**

Correlation is used to give information about the relationship between x and y. When the regression equation is calculated, the corelation results indicate the nature and the strength of the relationship.

**Correlation coefficient**

The correlation coefficient, r, indicates the nature and strength of the relationship between x and y. Values of r range from -1 to +1. A correlation coefficient of 0 means that there is no relationship. A value of -1 is a perfect negative coefficient and a correlation value of +1 indicates a perfect positive correlation.

A. Perfect negative B. No correlation C. Perfect positive

correlation r = 0 correlation

r = -1 r = +1

*Examples of correlational relationship.*

**Coefficient of determination**

Another
value of use in correlation analysis is the coefficient of determination which
is represented as r^{2}. Because
it is a square, it is always a positive number and varies between 0 and 1.

The coefficient of determination gives an indication of the contribution of the factor being studied in the regression analysis to the relationship between x and y. In the case of goldfish data, the regression analysis results in an equation of

y = 4.54x - 1.57

We see that as the temperature increases, the breathing rate increases. For these data,

r = .974077366

r^{2}
= .948691071

The value of the correlation
coefficient, r, is 0.97. This indicates
a very strong positive correlation between temperature and breathing rate. The coefficient of determination, r^{2},
has a value of .948. This indicates that
about 95% of the relationship is the result of the temperature which is the
factor being considered in this activity.

**Calculations for regression
and correlation**

The general form for the regression line is

y = a + bx + e

This equation provides a model
that can be used to predict relationships between x and y. When the regression equation is calculated in
the form y = a + bx, the method used to obtain the
result is known as the **method of least squares**. Return to Figure 35, page 148 for the data
and scatter plot of the goldfish opercular breathing rates.

A close examination of the figure below shows that there is a very close relationship between the points and their regression line. Note, however, that in no case does the line pass directly through any point.

A. Scatter plot B. Scatter plot with regression line

** ***Relationship of breathing
rate to temperature in goldfish.*

Each
point lies a little distance above or below the line. This distance is called a **residual** and
is the difference between the actual y value and its value according to the
regression line. When these values are
squared, the formulas give the equation of the line which makes these squares
at their total minimum value. The
discussion below will show where the values that the calculator gives you come
from.

Residuals are very useful in giving information about the data. If there is a relationship trend underlying the data, it will show up in the residuals. Data that give a good regression line have residuals that are randomly scattered on a residual plot. If there is a trend on the residual plot then something else is going on which needs attention in the data.

In order to do a regression and correlation calculation, certain items of information are necessary. These are listed in Table XIII along with the settings used on the TI-83 calculator. It is assumed that the x values are stored in list L1 and the y values are stored in list L2.

**Data used in regression and correlation
calculation.**

These items of information will be used to calculate the equation of the regression line, the correlation coefficient (which is also used to obtain the coefficient of determination) and the variance of the residuals.

**Calculate the regression
equation**

The regression equation is in the form y = a + bx. It is found using the values in the table above. The value of b is calculated first, then the value of a is obtained using the value obtained for b.

*Calculation of b.*

*Calculation of a*.

These two results give us the equation of the regression line in figure B which is

y = 4.53x - 1.57

**Calculate the correlation
coefficient**

The correlation
coefficient is given by the formula below which is sued in its calculaton.

The value of r obtained from this calculation is the correlation
coefficient. The square of the correlation
coefficient is the coefficient of determination.

**Calculation of residual
variance**

For the calculation of the residual variance, some additional formulas are used. These are listed in the following table.

**Information used in calculating
residual variance.**

The formula for residual variance is given below as is the calculation for the sample.

So the residual variance is
56.1365714. By taking the square root of
this variance, a value of 7.492434279 is obtained. This is the residual standard deviation that
is found using the linear regression *t*
test on the TI83. The calculator will
give all of the results obtained during the discussion above. The figure below gives the two screens
resulting from the linear regression *t*
test.

First Screen Second Screen

Results of linear regression *t* test.