Using National Data to Develop Synthetic Estimates

 

SELECTION OF METHODS

National surveys are conducted using sampling methods that provide reliable and valid estimates for the United States as a whole and for very large subregions (e.g., West, South, etc.). Sample size and sampling design prevent the use of these data to provide direct estimates for smaller geographic areas. In order to make state and local estimates of measures from national surveys, indirect or synthetic estimates may be used. This involves applying summary proportions generated from the national data and applying them to state or local census data, or creating synthetic estimates.

Indirect estimation involves the use of values of a variable of interest from a geographic area and/or time period other than the geographic area and time period of the estimate being produced.

Before discussing this technique further, other considerations of using national survey data will be discussed.

 

National Survey Data are Weighted

In a national sample survey, such as the National Health Interview Survey, each individual is assigned a weight based on complex sampling design. Some groups are oversampled. When analyzing weighted data, analysis must account for each individual's weight. After weighting is accounted for, the final estimate represents the total number of individuals for the nation. For example, a sample of 16,000 persons may represent, in the analysis, 24 million persons. Information about how the data was weighted is usually available in the documentation which accompanies the file. In the NHIS data files each record includes a variable which accounts for that individual's total weight in the survey.  

Using Published Tables to Obtain Indirect Estimates

In lieu of using a data file to develop estimates, one can use published data tables. Advantages of using published tables are that the estimates are already weighted and that no data processing skills are required. The primary disadvantage to this method is that you must use the estimates and the stratification groups provided by the authors.

Creating Estimates from a Data File Directly

Using the data file itself offers more flexibility to create a greater number of estimates for a variety of population groups. The majority of national survey data is now available on CD format and packaged with software to allow some manipulation of the data file.

The following example shows how a synthetic estimate can be calculated for the number of children ages 17 years and under with activity limitation in State X.  The following formula shows the weighted percent of children with activity limitation to be 6.8 percent (taken from the 1994 NHIS Disability Supplement) using the data from the table below.

 

div1.EPS (259722 bytes)

Characteristic
All ages 17 years and under

All Children

70,023,660

Activity Status

Not Limited

Limited

 

65,279,818 (93.2%)

4,743,842 (6.8%)

 

If the number of children ages 0-17 in State X is 800,000 then the estimated number of children with activity limitation in State X is 53,600. This was obtained using the following formula:

proportion of children with activity limitations in the national survey  

times

the number of children in State X  

equals

the estimate of the number of children with activity limitations in State X

or

using actual numbers:     

0.0068 x 800,000 = 54,400

 

In many cases these indirect estimates are better than the direct estimate collected locally. This is because many local data collection efforts fail to have adequate sample size to generate stable estimates for many populations of interest. 

 

Potential Biases Associated with Creating Indirect Estimates

Using data that has not been collected directly from the population of interest can introduce potential bias in the estimates that one should be aware of. Biases may differ depending on the application because the indirect estimate will be a better representation of reality in some population domains than in others. To use national data to estimate target populations in ALL areas,  the data must be stratified on population characteristics related to the indicator being estimated (e.g., age, sex,