*This file is part of a program based on
the Bio 4835 Biostatistics class taught at Kean
University in *

Categorical data analysis deals with the statistical study of discrete data that can be organized into categories. Biologists are always concerned with categorization and classification of things. It is the basis of biological taxonomy.

In categorical data analysis, the data fall into discrete categories and are not continuous. An example would be the case of an outbreak of a disease. Among a sample of the population, some people might have the disease while others might not.

In
studying categorical data analysis, the data are generally organized into a
contingency table. The contingency table
permits the calculation of proportions and other information that can be
obtained from the data. The c^{2} distribution is used in
categorical data analysis because the date consist of categories and
proportions.

**Structure of contingency tables**

The basic structure of a 2 by 2 contingency table consists of four cells arranged in two columns and two rows as shown below.

*The 2 X 2 contingency table. *

These cells are generally labeled with letters A through D. When doing calculations with these tables, we often add the columns and rows. The sums of the rows are

R1 = A + B

R2 = C + D

Similarly, the sums of the columns are:

C1 = A + C

C2 = B + D

The overall total, N, is found by adding R1 + R2 or C1 + C2 as shown below.

*Labeled 2 X 2 contingency table.*

**Using a contingency table as a comparison table**

Contingency
tables can be used for **comparison** of
outcomes of laboratory tests. In Medical
Technology, tests are routinely performed on patients. The patient may have a certain disease or
they may not. The test may give a
positive result or it may give a negative result. These are four discrete outcomes that can be
shown on a contingency table, such as the one below.

*Comparison of outcomes in laboratory tests. *

The
**false positive** result occurs when
the test gives a positive result but the patient does not have the
disease. The **false negative** result is when the test is negative but the patient
really has the disease.

**Relative risk**

Relative risk is a ratio of two probabilities. The first is the probability, P(E), of an event occurring in the presence of the risk factor and the second is the probability of the same event, P(E’), occurring in the absence of the risk factor. Relative risk is often used in the reporting of information on the occurrence of disease.

When used in reporting disease, relative risk is the ratio of the occurrence of the disease among those exposed to the risk factor and the occurrence of the disease among those not exposed to the risk factor. In order to determine these probabilities, the 2 X 2 contingency table is used as shown below.

*Contingency table for
relative risk study.*

In this contingency table, Row 1 indicates those who were exposed to the risk factor and Row 2 indicates those who were not exposed to the risk factor. In each case, some got the disease (Column 1) while others did not get the disease (Column 2).

In
Row 1, those exposed to the risk factor are considered as a
“success.” In this group
there is an **absolute risk** of getting
the disease which is P(E) = A/R1. P(E) is the
probability of the disease occurring in this exposed group and represents the
sample proportion of those exposed to the risk factor.

Similarly, in Row 2, those not exposed to the risk factor are considered as a “failure.” In this group, the absolute risk of getting the disease is P(E’) = C/R2. P(E’) is the probability of the disease occurring in this non-exposed group and represents the sample proportion of those with the disease among those who were not exposed to the risk factor. This is summarized by the following equations.

The relative risk, , is the ratio of these two proportions.

**Example: Outbreak of gastrointestinal disease**

From
CDC we learn of an outbreak of gastrointestinal illness in an elementary school
in *Morbidity and Mortality
Weekly Report*, March 19, 1999/48(10); 210-3) which was attributed to eating
burritos. There were 452 children
examined. Among these children, 304 ate
burritos of whom 145 became ill, while 10 of the 148
who did not eat the burritos became ill.
These data are presented in the table below.

*Contingency table for GI
illness outbreak in *

Among those children who ate burritos, the absolute risk of getting GI illness can be calculated.

Similarly, among the children who did not eat burritos, the absolute risk of getting GI illness can be calculated.

From these probabilities, we may calculate the relative risk.

Relative risk is generally reported with 1 decimal place. In this case our result would be RR = 7.1.

**Significance in relative risk calculations**

Significance in relative risk is found using the distribution. In general the value is found by subtracting the expected value from the observed value, squaring the result, and dividing by the expected value. The sum of these terms gives the value.

In contingency table calculations, the values from the table are used to give a value according to the following formula.

The values of these variables
are substitutes from the contingency table made from the data obtained in the
study. The result is a value of = 74.0447447 which is highly significant
considering that the highest value of in the table
with 1 *df* is
7.879.

The same result can be obtained from the TI83 calculator which uses a 2 X 2 matrix as shown below.

A. Matrix setup. B. Calculation results.

*Calculations
of for 2 X 2
contingency table.*

The calculator also gives the p
value for the computation which is, in this case, p = 7.63 x 10^{-18}
for these data, also indicating the highly significant nature of the answer.

**Confidence interval for a relative risk calculation**

As noted earlier in this program of study, a confidence interval is composed of three parts. These are shown below.

In relative risk calculations, the value of is the estimator. The reliability coefficient is 1.96 which corresponds to a 95% confidence interval. Now, we must calculate the standard error of .

follows with 1 *df* which is not a linear
relationship. The curve for with 1 *df* is similar to an exponential decay curve. It can be transformed to make it linear by
using a **logarithmic transformation**. Recall that = 7.059210526.
Logarithmic transformation of this value is ln=
1.954333272. The standard error for the
transformed is given by
the following equation.

This calculation can be done by substituting in the values and performing the appropriate operations on the calculator.

Once the standard error for ln is found, the confidence interval for ln can be calculated.

Finally, the antilog is taken to find the confidence interval for .

For this study, we would report the results as RR = 7.1 with a 95% CI of 3.6 – 13.9.

**Odds Ratio**

In epidemiological case-control studies, the odds ratio, OR is frequently calculated. In a case-control study, everyone involved in some way with the outbreak is included.

Recall that the absolute risk, which is really a proportion, can be calculated from the 2 X 2 contingency table. Also, the relative risk is the ratio of the absolute risk values between the exposed and non-exposed individuals. Now, the odds ratio relates the odds of getting the disease when exposed to the odds of getting the disease when not exposed. The general contingency table for a case-control study is shown below.

*General contingency table for
a case-control study.*

**How odds ratio is determined**

Theoretically,
the **probability** of an event, E, is
given by P(E) which is a proportion. The **odds**
of the event are found by dividing the proportion of the event, P(E), by the proportion of the event not occurring, which is
given as 1–P(E).

Odds (E) would be applicable to a case-control study for those who got the disease when exposed to the risk factor.

Among those not exposed to the risk factor, E’, there is a proportion who got the disease, P(E’). and a proportion who did not get the disease, 1-P(E’). The odds of getting the disease when not exposed to the risk factor are

The **odds ratio**, OR, is the ratio of these two odds.

The odds ratio gives an indication of the chances of getting the disease when exposed to the risk factor compared with the odds of getting the disease when not exposed to the risk factor.

**Relationship of odds ratio to the contingency table**

The proportions used to find the odds ratio can be determined from the contingency table.

The individual odds calculations can be estimated by substitution.

Based on these estimations, the value of OR is found.

The value of OR, therefore, can be estimated by finding the ratio of the cross products of the values in the cells of the contingency table.

**Example: Outbreak of gastrointestinal disease—case-control study**

In
the same report cited above on gastrointestinal illness associated with eating
burritos, a case-control study was conducted in

*Contingency table for
case-control study.*

Odds ratio calculations

The odds ratio, OR, for the data in the figure above, is found using the ratio of the cross products as described above.

It is also possible to calculate a for the data along with its p value, as well as a confidence interval.

The matrix and results of the calculation are given below.

A Matrix B. Results

*Calculation of* *test for case-control study*.

The results give of 10.56 which
is significant at the p = .0012 level.
These values exceed the critical value of for *df* 1 which is
7.879 at the 99.5% level.

Construction of the confidence interval for OR is done in the same way as that for RR using logarithmically transformed data. The calculation for standard error after logarithmic transformation is

The value of ln 8.8 = 2.174751721. The 95% confidence interval for the transformed relationship is

Taking antilogs, we get the confidence interval.

Therefore, OR = 8.8 with CI of 2.14-36.3 in this study.