Categorical Data Analysis

This file is part of a program based on the Bio 4835 Biostatistics class taught at Kean University in Union, New Jersey.  The course uses the following text:
Daniel, W. W. 1999.  Biostatistics: a foundation for analysis in the health sciences.  New York: John Wiley and Sons.  
The file follows this text very closely and readers are encouraged to consult the text for further information.

Introduction

 

            Categorical data analysis deals with the statistical study of discrete data that can be organized into categories.  Biologists are always concerned with categorization and classification of things.  It is the basis of biological taxonomy.

 

            In categorical data analysis, the data fall into discrete categories and are not continuous.  An example would be the case of an outbreak of a disease.  Among a sample of the population, some people might have the disease while others might not.

 

            In studying categorical data analysis, the data are generally organized into a contingency table.  The contingency table permits the calculation of proportions and other information that can be obtained from the data.  The c2 distribution is used in categorical data analysis because the date consist of categories and proportions.

 

 

Structure of contingency tables

 

            The basic structure of a 2 by 2 contingency table consists of four cells arranged in two columns and two rows as shown below.

 

                                                The 2 X 2 contingency table.

 

            These cells are generally labeled with letters A through D.  When doing calculations with these tables, we often add the columns and rows.  The sums of the rows are

                                                            R1 = A + B

                                                            R2 = C + D

 

Similarly, the sums of the columns are:

 

                                                            C1 = A + C

                                                            C2 = B + D

 

The overall total, N, is found by adding R1 + R2 or C1 + C2 as shown below.

 

                       

                                                            Labeled 2 X 2 contingency table.

 

Using a contingency table as a comparison table

 

            Contingency tables can be used for comparison of outcomes of laboratory tests.  In Medical Technology, tests are routinely performed on patients.  The patient may have a certain disease or they may not.  The test may give a positive result or it may give a negative result.  These are four discrete outcomes that can be shown on a contingency table, such as the one below.

 

                                    Comparison of outcomes in laboratory tests.                         

 

            The false positive result occurs when the test gives a positive result but the patient does not have the disease.  The false negative result is when the test is negative but the patient really has the disease.

 

Relative risk

 

            Relative risk is a ratio of two probabilities.  The first is the probability, P(E), of an event occurring in the presence of the risk factor and the second is the probability of the same event, P(E’), occurring in the absence of the risk factor.  Relative risk is often used in the reporting of information on the occurrence of disease.

 

            When used in reporting disease, relative risk is the ratio of the occurrence of the disease among those exposed to the risk factor and the occurrence of the disease among those not exposed to the risk factor.  In order to determine these probabilities, the 2 X 2 contingency table is used as shown below.

 

 

                                                Contingency table for relative risk study.

 

            In this contingency table, Row 1 indicates those who were exposed to the risk factor and Row 2 indicates those who were not exposed to the risk factor.  In each case, some got the disease (Column 1) while others did not get the disease (Column 2).

 

            In Row 1, those exposed to the risk factor are considered as a “success.”  In this group there is an absolute risk of getting the disease which is P(E) = A/R1.  P(E) is the probability of the disease occurring in this exposed group and represents the sample proportion of those exposed to the risk factor.

 

            Similarly, in Row 2, those not exposed to the risk factor are considered as a “failure.”  In this group, the absolute risk of getting the disease is P(E’) = C/R2.  P(E’) is the probability of the disease occurring in this non-exposed group and represents the sample proportion of those with the disease among those who were not exposed to the risk factor.  This is summarized by the following equations.

 

 

The relative risk, , is the ratio of these two proportions.

 

 

Example: Outbreak of gastrointestinal disease

 

            From CDC we learn of an outbreak of gastrointestinal illness in an elementary school in Georgia (reference: Morbidity and Mortality Weekly Report, March 19, 1999/48(10); 210-3) which was attributed to eating burritos.  There were 452 children examined.  Among these children, 304 ate burritos of whom 145 became ill, while 10 of the 148 who did not eat the burritos became ill.  These data are presented in the table below.

 

 

                                    Contingency table for GI illness outbreak in Georgia.

 

Among those children who ate burritos, the absolute risk of getting GI illness can be calculated.

 

 

Similarly, among the children who did not eat burritos, the absolute risk of getting GI illness can be calculated.

 

 

From these probabilities, we may calculate the relative risk.

 

  

 

Relative risk is generally reported with 1 decimal place.  In this case our result would be RR = 7.1.

 

 

Significance in relative risk calculations

 

            Significance in relative risk is found using the distribution.  In general the value is found by subtracting the expected value from the observed value, squaring the result, and dividing by the expected value.  The sum of these terms gives the value.

 

 

            In contingency table calculations, the values from the table are used to give a value according to the following formula.

 

 

The values of these variables are substitutes from the contingency table made from the data obtained in the study.  The result is a value of  = 74.0447447 which is highly significant considering that the highest value of in the table with 1 df is 7.879.

 

            The same result can be obtained from the TI83 calculator which uses a 2 X 2 matrix as shown below.

 

         

 

                                                A. Matrix setup.                            B. Calculation results.

 

                                                       Calculations of for 2 X 2 contingency table.

 

The calculator also gives the p value for the computation which is, in this case, p = 7.63 x 10-18 for these data, also indicating the highly significant nature of the answer.

 

Confidence interval for a relative risk calculation

 

            As noted earlier in this program of study, a confidence interval is composed of three parts.  These are shown below.

 

In relative risk calculations, the value of is the estimator.  The reliability coefficient is 1.96 which corresponds to a 95% confidence interval.  Now, we must calculate the standard error of .

 

            *follows with 1 df which is not a linear relationship.  The curve for  with 1 df is similar to an exponential decay curve.  It can be transformed to make it linear by using a logarithmic transformation.  Recall that  = 7.059210526.  Logarithmic transformation of this value is ln= 1.954333272.  The standard error for the transformed is given by the following equation.

 

 

This calculation can be done by substituting in the values and performing the appropriate operations on the calculator.

 

 

Once the standard error for ln is found, the confidence interval for ln can be calculated.

 

 

Finally, the antilog is taken to find the confidence interval for .

 

 

For this study, we would report the results as RR = 7.1 with a 95% CI of 3.6 – 13.9.

 

Odds Ratio

 

            In epidemiological case-control studies, the odds ratio, OR is frequently calculated.  In a case-control study, everyone involved in some way with the outbreak is included.

 

            Recall that the absolute risk, which is really a proportion, can be calculated from the 2 X 2 contingency table.  Also, the relative risk is the ratio of the absolute risk values between the exposed and non-exposed individuals.  Now, the odds ratio relates the odds of getting the disease when exposed to the odds of getting the disease when not exposed.  The general contingency table for a case-control study is shown below.

 

 

                                    General contingency table for a case-control study.

 

How odds ratio is determined

 

            Theoretically, the probability of an event, E, is given by P(E) which is a proportion.  The odds of the event are found by dividing the proportion of the event, P(E), by the proportion of the event not occurring, which is given as 1–P(E).

 

 

Odds (E) would be applicable to a case-control study for those who got the disease when exposed to the risk factor. 

 

            Among those not exposed to the risk factor, E’, there is a proportion who got the disease, P(E’). and a proportion who did not get the disease, 1-P(E’).  The odds of getting the disease when not exposed to the risk factor are

 

                                               

 

The odds ratio, OR, is the ratio of these two odds.

 

                                               

 

The odds ratio gives an indication of the chances of getting the disease when exposed to the risk factor compared with the odds of getting the disease when not exposed to the risk factor.

 

Relationship of odds ratio to the contingency table

 

            The proportions used to find the odds ratio can be determined from the contingency table.

 

 

The individual odds calculations can be estimated by substitution.

 

                       

 

Based on these estimations, the value of OR is found.

 

                                               

 

The value of OR, therefore, can be estimated by finding the ratio of the cross products of the values in the cells of the contingency table.

 

 

Example: Outbreak of gastrointestinal disease—case-control study

 

            In the same report cited above on gastrointestinal illness associated with eating burritos, a case-control study was conducted in Florida.  Data are shown in the table below.

 

                       

 

                                                Contingency table for case-control study.

 

Odds ratio calculations

 

            The odds ratio, OR, for the data in the figure above, is found using the ratio of the cross products as described above.

 

                                               

 

It is also possible to calculate a for the data along with its p value, as well as a confidence interval.

 

The matrix and results of the calculation are given below.

 

                                              

 

                                    A Matrix                                                         B. Results

 

                                                Calculation of test for case-control study.

 

The results give of 10.56 which is significant at the p = .0012 level.  These values exceed the critical value of for df 1 which is 7.879 at the 99.5% level.

 

            Construction of the confidence interval for OR is done in the same way as that for RR using logarithmically transformed data.  The calculation for standard error after logarithmic transformation is

 

                                               

 

The value of ln 8.8 = 2.174751721.  The 95% confidence interval for the transformed relationship is

 

                                               

 

Taking antilogs, we get the confidence interval.

 

                                               

 

Therefore, OR = 8.8 with CI of 2.14-36.3 in this study.