This file is part of a program based on the Bio 4835 Biostatistics class taught at
Daniel, W. W. 1999. Biostatistics: a foundation for analysis in the health sciences.
The file follows this text very closely and readers are encouraged to consult the text for further information.
A) Confidence interval for a population mean
Estimating the mean
Estimating the mean of a normally distributed population entails drawing a sample of size n and computing which is used as a point estimate of .
It is more meaningful to estimate by an interval that communicates information regarding the probable magnitude of .
Sample distributions and estimation
Interval estimates are based on sampling distributions. When the sample mean is being used as an estimator of a population mean, and the population is normally distributed, the sample mean will be normally distributed with mean, , equal to the population mean, , and variance .
The 95% confidence interval
Approximately 95% of the values of x making up the distribution will lie within 2 standard deviations of the mean. The interval is noted by the two points, and , so that 95% of the values are in the interval, .
Since and are unknown, the location of the distribution is uncertain. We can use as a point estimate of . In constructing intervals of , 95% of these intervals would contain .
Suppose a researcher, interested in obtaining an estimate of the average level of some enzyme in a certain human population, takes a sample of 10 individuals, determines the level of the enzyme in each, and computes a sample mean of x = 22. Suppose further it is known that the variable of interest is approximately normally distributed with a variance of 45. We wish to estimate .
An approximate confidence interval for is given by:
Components of an interval estimate
This is the general form for an interval estimate.
estimator ± (reliability coefficient) (standard error)
The general form for an interval estimate consists of three components. These are known as the estimator, the reliability coefficient, and the standard error.
Estimator: The interval estimate of is centered on the point estimate of . As noted in the table above, is an unbiased point estimator for .
Reliability coefficient: Approximately 95% of the values of the standard normal curve lie within 2 standard deviations of the mean. The z score in this case is called the reliability coefficient. We use a value of z that will give the correct interval size. The proper z score depends on the value of being used. Generally, the three values of most commonly used are .01, .05 and .10. Their corresponding z scores are 1.645, 1.96 and 2.575, respectively, as shown in the table below.
Table of reliability coefficients
Standard error: The standard error equals
Interpretation of confidence intervals
The interval estimate for is expressed as:
Assuming that we are using a value of =.05, we can say that, in repeated sampling, 95% of the intervals constructed this way will include . This is based on the probability of occurrence of different values of .
The area of the curve of that is outside the area of the interval is called , and the area inside the interval is called 1- .
Interpretation of the interval
There are two ways in which interval estimates can be interpreted. These are known as the probabilistic interpretation and the practical interpretation.
The probabilistic interpretation results from repeated sampling. With repeated sampling from a normally distributed population with a known standard deviation, 100(1- ) percent of all intervals in the form will, in the long run, include the population mean, . The quantity 1- is called the confidence coefficient or confidence level and the interval, , is called the confidence interval for .
Note that the percentage of intervals involved depends on the value of . With modern electronic devices such as the TI-83 calculator and Microsoft Excel, it is possible to use any value of . When statistics was developing during the 20th century, such devices were not generally available so one had to use tables. These tables were very difficult to prepare and so only a few values of were supported. The most commonly used values of are .01, .05, and .10. When these are used in the formula 100 (1- ), they yield percentages of 99%, 95%, and 90%, respectively. The most widely used value for a confidence level is 95%, which corresponds to =.05. Using this figure, the probabilistic interpretation says that in 100 samplings, 95 of them should include . For situations in which there is neither time nor ability to do 100 samplings, the practical interpretation is used.
The practical interpretation of the interval is used for a single sampling. When sampling is from a normally distributed population with known standard deviation, we are 100(1- ) percent confident that the single computed interval, , contains the population mean, .
Precision indicates how much the values deviate from their mean. Precision is found by multiplying the reliability factor by the standard error of the mean. This is also called the margin of error.
We wish to estimate the mean serum indirect bilirubin level of 4-day-old infants. The mean for a sample of 16 infants was found to be 5.98 mg/dl. Assuming bilirubin levels in 4-day-old infants are approximately normally distributed with a standard deviation of 3.5 mg/dl find:
A) The 90% confidence interval for
B) The 95% confidence interval for
C) The 99% confidence interval for
n = 16
We start with the formula for an interval estimate then substitute the values given in the problem.
Then we need to determine the values of the reliability coefficient that will be used in solving the three parts of the problem. We consult the Table of Reliability Coefficients above. The correct value of reliability coefficient is multiplied by the standard error (.975). The resulting value is subtracted from then added to the value of to give the boundaries of the interval estimate.
A) 90% interval (z = 1.645)
5.98 ± 1.645 (.875)
Interpretation: We estimate the population mean to be 5.98. We are 90% confident that the true value of the mean lies between 4.5408 and 7.4129)
B) 95% interval (z = 1.96)
5.98 ± 1.96 (.875)
Interpretation: We estimate the population mean to be 5.98. We are 95% confident that the true value of the mean lies between 4.265 and 7.695)
C) 99% interval (z = 2.575)
5.98 ± 2.575 (.875)
Interpretation: We estimate the population mean to be 5.98. We are 99% confident that the true value of the mean lies between 3.7261 and 8.2339)
A higher percent confidence level gives a wider band. There is less chance of making an error but there is more uncertainty.
Calculator answers are more accurate because the calculator uses exact values and derives its answers from calculus.
The t distribution
In most real life situations the variance of the population is unknown. We know that the z score, , is normally distributed if the population is normally distributed and is approximately normally distributed when the population is large. But, it cannot be used because is unknown.
Estimation of the standard deviation
The sample standard deviation, , can be used to replace . If n 30, then s is a good approximation of . An alternate procedure is used when the samples are small. It is known as Student's t distribution.
Student's t distribution
Student's t distribution is used as an alternative for z with small samples. It uses the following formula:
Properties of the t distribution
1. Mean = 0
2. It is symmetrical about the mean.
3. Variance is greater than 1 but approaches 1 as the sample gets large. For df > 2, the variance = df/(df-2) or
4. The range is - to + .
5. t is really a family of distributions because the divisors are different.
6. Compared with the normal distribution, t is less peaked and has higher tails.
7. t distribution approaches the normal distribution as n-1 approaches infinity.
Confidence interval for a mean using t
When sampling is from a normal distribution whose standard deviation, , is unknown, the
100(1- ) percent confidence interval for the population mean, , is given by:
Deciding between z and t
When constructing a confidence interval for a population mean, we must decide whether to use z or t. Which one to use depends on the size of the sample, whether it is normally distributed or not, and whether or not the variance is known. There are various flowcharts and decision keys that can be used to help decide. Mine appears below.
Key for deciding between z and t in confidence interval construction
1. Population normally distributed................2
Not as above—normally distributed.........5
2. Sample size is large (30 or higher)............3
Sample size is small (less than 30)............4
3. Population variance is known.............use z
Population variance not known.... use t (or z)
4. Population variance is known.............use z
Population variance is not known.......use t
5. Sample size is large..................................6
Sample size is small..................................7
6. Population variance is known.............use z
Population variance not known
(central limit theorem applies)............use z
7. Must use a non-parametric method
In a study of preeclampsia, Kaminski and Rechberger found the mean systolic blood pressure of 10 healthy, nonpregnant women to be 119 with a standard deviation of 2.1.
(Preeclampsia: Development of hypertension, albuminuria, or edema between the 20th week of pregnancy and the first week postpartum.
Eclampsia: Coma and/or convulsive seizures in the same time period, without other etiology.)
a. What is the estimated standard error of the mean?
b. Construct the 99% confidence interval for the mean of the population from which the 10 subjects may be presumed to be a random sample.
c. What is the precision of the estimate?
d. What assumptions are necessary for the validity of the confidence interval you constructed?
n = 10
s = 2.1
(2) Sketch of t distribution
99% confidence interval
(The correct value of t for a 99% confidence interval with 9 degrees of freedom is 3.2498)
119 ± 3.2498 (.66407...)
Precision = 3.2498 (.66407...)