University of Illinois at Chicago, School of Public Health
Environmental and Occupational Health Sciences Division

Introduction to Environmental Statistics

Module 3: Quality Assurance Quality Control

DR. PETER SCHEFF: Hello and welcome to our third in the series of lectures on environmental statistics and air quality data evaluation. Today's lecture will be primarily on issues around quality assurance and quality control. We're going to demonstrate a number of tools that are helpful for looking at data and for making decisions. There are a number of data sets I'll be referring to throughout this lecture. These are all on-line on the website, so I encourage you to download the data sets, take a look at our calculations, and probe a little bit and of course as always send us your questions and comments. So the objectives today are to discuss: What are data quality objectives and what is the process, the formal process for data quality objectives, and to look at ambient air quality surveillance program objectives specifically for data quality. My final example today is a detailed evaluation of the PM 2.5 program. And finally we'll show a number of large and small examples throughout the discussion.

Now, what are the steps in the data quality assessment process? There are a number of key steps that you want to go through quite formally. The first is to review the objectives and the sampling design. You want to specify before you begin what it is you want to try to achieve in terms of data quality. Part of that is to define what level of confidence and the parameters to be measured that you need to have for making decisions: What is your goal for reproducibility and accuracy? These are specified parameters that you want to measure and demonstrate through your data quality objective program that you've met. The third step is to conduct a preliminary review, look at the measurements, look at them using graphs and summary statistics. And we'll look at a number of graphical techniques in this lecture, but I just want to point out that we'll come back to graphs in future lectures in a lot more detail. But summarizing data by looking at it graphically is very important.

The next step is to select the test you're going to use to evaluate the data. If you want to decide, for example, whether or not your concentration is above a threshold, you will probably use a T test. You have to verify the assumptions of the test: Are you violating any underlying assumptions that will make the conclusion drawn from the testing valid, and finally draw conclusions. I will show you how to do this in a couple of cases and show you what some of the pitfalls are. Now, I want to make a few comments to sort of set the stage here. One is this process never absolutely proves anything. We don't prove that a concentration is above a threshold or below a threshold. We support a conclusion or reject a conclusion that we can draw with some decision error potential. We never know what the underlying true value is. And we talked a lot about this in the last lecture.

One reason why we don't know or can never know what the true mean value is, for example, is there's measurement error. Another is that we don't have enough samples. Maybe the monitoring location is not appropriate. Maybe there's bias in the measurement device. There are many of reasons why all we can do is estimate the true underlying value. That means when we go to make a decision we have the chance of coming to a wrong conclusion. So the data quality assessment only provides an estimate of the true value, and you can always make a wrong decision. So you need to be careful. Now, we have a fairly formal process we go through to do this. And you begin the process by stating a hypothesis, the null condition. It's a baseline condition that's presumed to be true in the absence of strong evidence to the contrary. So we start by saying we're in compliance, for example. And then we're looking for evidence that strongly suggests that we're not in compliance.

If there's strong evidence that the null statement is not true, then we accept the alternative hypothesis, which could be that you're not in compliance. So the hypothesis process is a process that requires a numerical value with which the measured parameter will be compared. This numerical value can be the concentration measured in another location, or the concentration measured last year, or a threshold value. So for a risk-based threshold, for example, we would use some value that scientists have agreed is the point that you want to be below. So you hypothesize you're either above or below the standard. Then you look for evidence to disprove that hypothesis.

Most national ambient air quality standards compliance determinations are not based on this process. They're based on just comparison to a number. But just recognize that these national ambient air quality standards do have a statistical basis to them. For example, they're based on averages over long periods of time, or they're based on percentiles. But we tend to not use this formal process for the national ambient air quality standards but we do use it quite routinely in risk-based programs.

Now, the parameter that we're looking at can be a number of things. Typically it's a mean, but it doesn't have to be a mean. It can be a percentile or a median. So you might want to use the 99th percentile. In the last lecture, we were looking at detection limits. Detection limits were probabilistically defined numbers that said when we felt we had sufficient confidence to call a measurement valid. We use the 99% confidence in those detection limit calculations. That's a very conservative criterion. Most statisticians and data analysts pretty much accept 95% as the decision error level. So from the tests and for the examples I'll be showing today, we'll use a confidence of 95%. That is, you have to be 95% certain that your null statement is not valid before you accept the alternative. But this is a value that can be varied. And there's nothing absolute about 95%. It's just one threshold that is commonly used. Now in this process there are two kinds of errors that you can possibly make. Type I and Type II error.

Type I error is when you reject null hypothesis, when in fact it's true. And as a result, when you come to the conclusion that the alternative is true when in fact it's not. Type II error is accepting the hypothesis when it is false. You conclude that your null statement is true but in fact it's not. These two statements are more or less conservative, depending on how you state the null hypothesis. I'll show you by example what I mean. Here's an example that EPA has used in one of their documents on community risk. Scientists have decided that, for example, a threshold for chromium in a community should be less than one microgram per particulate meter. This is a risk-based number. It's based on a probabilistic risk assessment. Now there are two possible hypotheses you can state in evaluating whether or not concentrations measured in this community are acceptable or not. Hypothesis one is that the chrome concentration is greater than the threshold. So you can state as a null hypothesis the underlying mean is greater than one. The alternative hypothesis, the underlying mean is less than one.

The other way of doing this would be to state up front that the chrome concentration in the community is less than one. So your null hypothesis is that the underlying mean is less than or equal to one. The alternative hypothesis is, the mean is greater than one. These are not equivalent statements. To conclude that statement one is not true you must have a mean concentration quite a bit less than one. To conclude that statement two is not true, you must have the mean concentration quite a bit greater than one. So there is an area in-between these decision thresholds where it's not so clear what's going on. This is sometimes called the gray zone. And it does compromise your ability to make an informed decision.

So what are our two possible decision errors? If we state a null hypothesis that the concentration is greater than one, one decision error is that the error levels exceed the threshold when it's below. If we state in the other null hypothesis that the concentration is below one, then we could make a decision error that in fact the level is below when it is actually above. Now, these are not equivalent statements. The risk of deciding that the error is below the threshold when actually it's above is a more severe risk, because the potential for human health is important. We want to protect human health. We don't want to think that it's less than one when it actually is greater than one. So it turns out that null hypothesis one is more conservative.

In the language of statistics, we're talking about a one-tail test. We're only concerned about whether or not the concentration is below the threshold, not how high it is above the threshold. So when we make these comparisons, we're going to use, as our reference point, a one-tailed probability test. And I'll demonstrate what that is with some examples. So here are our two possible statements: Null hypothesis one is that the underlying mean is greater than the threshold concentration and null hypothesis two that the underlying mean is less than the threshold concentration. C now is a regulatory threshold level. Remember that the first null hypothesis is a bit more conservative. To do this test, we're going to use a T test, and a T statistic is shown in this equation. And given a random sample of size n, the one tail T statistic can be used to test a hypothesis involving the mean of the population from which the sample was selected.

Here we're looking at the difference between the sample mean and the threshold value, and we're looking for how far that difference is from zero. If that difference is statistically significantly different from zero, then we would reject our null hypothesis and accept the alternative. Now, we're not concerned if X is very much less than C. Remember, we're only concerned if X is much greater than C. So we're going to look at this as a one-tailed probability. So we compute t. And we compare t to some reference value of t. We call this the critical t, and we look this up with n minus one degrees of freedom, and we accept a number of assumptions that are built into this. One is that we have independent random samples and that the distribution of the sample mean is approximately normally distributed. Probably, not a bad assumption. We're also working with a test that's not particularly robust with outliers because the mean and the standard deviation are influenced by extreme values. If you have a number of data points that are extremely different from all the other ones, you have to be careful with the T statistic.

It is also not particularly useful in situations where you have a lot of data below the detection limit. So we might want to use a non-parametric test of proportions if we have a number of outliers or below detection limit data.

I also want to talk a little bit about how Excel tabulates the T distribution. As much as possible in this course I'm going to rely on Excel for the calculations, mostly because it's just an easy thing to use and it's generally available. But Excel has a T function which operates in a two-parameter concept. That is, if you specify a probability, Excel is going to put one-half of the probability in the left-hand tail and one-half of the probability on the right-hand tail, much like the picture I showed you in the last lecture. Now, this is in contrast to the standard normal function where it's a cumulative function. So in Excel, if you look up a standard normal probability, or standard normal Z score, it will give you the Z value for minus infinity up to that probability. But the T statistic, you're going to specify the error in the middle of the curve and it will put the error in the tails.

Let me show you a couple examples. I just created this table using the T inverse function and normal standard inverse function. So, for example, if you ask Excel for a 5% confidence probability or 95% confidence with 20 degrees of freedom, you'll come back with a value of T of 2.08. Now, the T distribution becomes the normal distribution for very large sample sizes. So I computed T values with a sample size of a thousand because these Ts will pretty much be the same as the Z statistic or the standard normal statistic. So shown in this table are a number of T values for different probabilities and number of Z values from the normal distribution for different probabilities. So you can see the difference in the way these two functions work. If you look at the Z value, the cumulative probability up to 95%, the P value, P 0.95, cumulative probability of Z is 1.6449. If you look on the T side of the statistic, that 5% area in the tail of 1.64 is achieved when you specify the probability of 0.1. 0.1 says 0.05 in the right-hand tail, and 0.05 in the left-hand tail. Remember that the T is calculated as a two-tailed probability.

The Z is the cumulative probability or a one-tailed probability. That's just the way Excel works. Unfortunately, if you open up your statistics book, it could be either way as cumulative probability or two-tailed probability. You just have to look and see and be a little bit careful. So let's do some examples. One data set that I'll be referring to extensively in this lecture is the Northbrook data set. This is a series of PM 2.5 measurements collected in Northbrook , Illinois . We'll use the calendar year 2003 with 118 measurements. This is an FRM monitor. We'll state the hypothesis that the mean concentration at Northbrook is greater than C, the standard of 15 micrograms per cubic meter. The alternative hypothesis in this case is that we're below the standard, C being the regulatory limit. We calculate T as the difference between the estimated parameter, mean X minus the regulatory threshold C divided by the standard error, standard deviation of the sample mean divided by the square root on n. In this case the T statistic is minus 4.31.

Now we want a critical value with 5% error in one tail, with 117 degrees of freedom, n minus one. So we look up our critical T of 0.1, 117, and it comes back as 1.66. So the absolute value of T, 4.3, is much greater than our T statistic of 1.66. Pretty strong evidence that the null hypothesis is not true. So we accept the alternative hypothesis that the concentration at Northbrook is below the standard. That seemed to work quite nicely. It doesn't always work so clearly.

Let's take another example. This is the Springfield data set on the web, another data set I'll refer to in this lecture pretty extensively. Now at Springfield we had 105 measurements over the year, and the average of these 105 measurements was 15.6. Pretty convincing evidence that we're not below the standard. In fact, the T statistic is 0.78 in this case, which is not statistically significantly different from the critical value. So we would have to accept a null hypothesis that the mean concentration is above the standard. Now, here's where the decision-making problem comes in. Let's say we stated the alternate null hypothesis, that the regulatory threshold was less than the measured concentration. It sure looks like it.

But there's not a compelling evidence to reject that null hypothesis either. So whether we state that we're in compliance with the standard or out of compliance with the standard, we would come to the same conclusion to accept the null hypothesis. So we're kind of in this gray zone where we can't use statistics to help us make a decision. This happens quite frequently. In addition to looking at the comparison between single value, single sample mean to a regulatory threshold, another situation you're in when you're comparing two measured or estimated means, this could be population one to population two, city one to city two, year one to year two, where you're comparing the mean at one location or sample from another mean sample. You're looking at whether or not the difference between these two means are different from zero or not. You would still use the T distribution here but now use a pool estimate of the standard error.

And you would probably use a two-tailed test because unless you a priori know that one sample mean is greater than another, you have to look at this as a two-tailed probability. Now, there are a number of assumptions going on here. We're assuming a random sample of X size n1 drawn from population one, random sample Y of size n2 drawn from population two. And we're assuming that the two variances are approximately equal and so we can come up with a valid estimate of the pooled variance. So you always have to ask yourself these questions when you're doing these kinds of tests: Are the assumptions reasonable? Are the errors truly normally distributed or not? Are the errors uncorrelated? If the errors are correlated, this could lead to problems in the T test. And do errors have the constant variability? That is, if you look at the distribution of errors, are they not a function of concentration?

We'll look at some of these issues in this and subsequent lectures, but just as a warning that there are deeper issues here that you may need to call a consultant in to deal with. So where do we begin? We begin with our preliminary data review. And the preliminary data review is just basically looking at your data to get some basic quantitative information from it. There are a number of things that I'll show you to look for, looking at importance, numerical quantities is one. Looking at simple graphs to understand what the data looks like. Statistical testing which we've looked at a little bit, and looking at assumptions underneath the test. Other issues that we'll deal with much more in future lectures are things like outliers, transformations and censored data. These are problems that make data a little bit difficult to look at. We will get back to these in subsequent lectures. So what are numerical quantities? These are useful numbers that describe the underlying data set. The percentile is one which we use extensively.

The percentile in the data is a data value that is greater or equal to a given percentage of the data values. The statistical terms is the Pth percentile is the data value that is greater than or equal to P percent of the data and is less or equal to 1 minus P percent of the data. So sometimes percentiles can fall between values, and you interpolate at other times they fall between other values. What are important percentiles? Probably the most important one is median or 50th percentile, but other ones used quite frequently are the 25th percentile and 75th percentile. Remember that the 25th to 75th percent is this interquartile range and it's a useful measurement of the data set. Other important ones are the 90th percentile, 95th percentile. 98th percentile is important because it's built into the PM 2.5 standard and 99th percentile is important because it's built into the PM 10 standard. So these are statistical quantities that you'll be dealing with quite frequently. Now, we can look at these quantities as describing either measures of central tendency or measures of dispersion.

Measures of central tendency are things like mean, the median and the mode. The mean being the average, the median being the 50th percentile, and the mode is the value that occurs most frequently. Measures of dispersion could include range, lowest to highest, the sample variance and sample standard deviation, the coefficient of variation, which is the ratio of standard deviation to the sample mean or the interquartile range. These measures tell you how the data is distributed around the measure of central tendency. You want to use both in looking at your data. Look at the Northbrook data set, look at how it's dispersed, how it's distributed and what it looks like. Again, this is a one-year data set from 2003. After measurements of PM 2.5 on a once every three-day basis, there were a total of 118 samples, a pretty complete data set that we're missing very few observations. Now to begin with, I took this data and I sorted it from lowest to highest. If you pull this data set off the web, you'll see it includes dates when each measurement was collected.

But this is the first ten and last ten observations from this data set. You simply sort lowest to highest and rank it. Rank one is 2.6 micrograms per cubic meter. Rank 118 is 144 micrograms per cubic meter. If you just look at this data set and the spreadsheet shows these calculations, the mean is 12.1. The median is 10.6 and the mode is 12.0. And the Excel functions average, median, mode, just simply pull those numbers right out of the data. The measures of dispersion, the range is 2.6 to 44. The sample standard deviation is 7.27. And the coefficient of variation, the ratio of standard deviation to mean is 0.60. If you recall, on our PM 10 example from the last lecture, the sample coefficient of variation for PM 10 from three years of data was 0.52. So PM 2.5 superficially is distributed somewhat similarly to PM 10. Interquartile range 6.7 to 14.4. Now how do you look at this data? The simplest way to look at this data is to create a histogram. A histogram is a simple picture of the distribution of sample frequency. So to do this, we rank the data.

I showed you that already. And then because we have a fairly large number of observations we want to collapse these into some bins or intervals. And I'm going to select two micrograms per cubic meter as the interval width here. I’ll break our zero to 44 up into 22 bins, each bin being two micrograms wide, and I'm going to count the number of observations that fall within each one of these intervals. Now to do that in Excel we have a function called "count if." Count if is a cumulative count function which gives us the total number of cases that meet a certain criteria. So, for example, I show here the first statement “count if”, and I give a range to look over. “Count if” less than 2 will look down the whole column of numbers and count the number of times the condition less than 2 occurs. Or we can look at “count if” less than 4. “Count if” less than 6 and you do this 22 times, or 24 times and you get the “count if” through all the bins.

Now, this returns the cumulative, count the total number of counts less than this criteria. So in this spreadsheet you can see that there are zero counts less than 2 and four counts less than 4, a total of 20 counts than six, total of 40 counts less than 8. To come up with the individual counts within each one of these ranges, you simply take the difference between successive cumulative counts. So 4 minus 0 is 4, there's four counts between 2 and 4, and 20 minus 4 is 16. There are 16 counts between 6 and 4, et cetera. So the right-hand column of this little spreadsheet here shows you the total number of counts within each interval range. And then we could just simply go to the graphing function and select histogram and specify the counts for the Y variable and the upper limit of the counting of the concentration interval for the label on the X axis, and Excel returns this graph. This is the frequency distribution of the PM 2.5 measured at Northbrook.

What I've done on the next slide is I simply overlaid on the slide the measures of central tendency. Here you can see the mean of 12.1 and the median of 10.6. The median is just a little bit to the right of the mode, the mode being eight. The mean is a little bit to the right of the median. And means tend to be a little higher than medians because we usually have extreme values to the right. And interquartile range of 6.7 to 14.4, really nicely in this case includes all of the high bars on the graph. So this is a nice simple picture of the data.

Now, other important numerical quantities that you need to understand for quality assurance evaluation or techniques are measures of association. These are relationships or levels of association between two or more variables. And the two quantities that I'm going to show in examples right here are correlation coefficients that are either parametric or non-parametric.

The parametric coefficient is called a Pearson's coefficient. It measures the strength of the linear relationship between two variables and it varies between minus one and plus one, minus one being a perfect negative correlation, zero being no correlation whatsoever, plus one being a perfect positive correlation. But it's a parametric value and it can be strongly influenced by extreme values. You can have a data set where there's no correlation at all except one extreme value. And the presence of that one extreme value will artificially cause the Pearson's coefficient to be quite high, when the underlying relationship is really zero.

The Spearman rank order correlation is a non-parametric measure of association, and it's based on correlating the ranks of the variables rather than the variables in the original units. It is not influenced by extreme values. And it's quite useful to look at. So to demonstrate these correlation coefficients, I'm going to use the Springfield data set.

The Springfield data set is interesting because it was a monitoring site in Chicago where there were three independent measures of PM 2.5. POC1 is a measurement with the FRM that we used back in the T test example. POC5 was a 24-hour average gravimetric measurement of PM 2.5 from the speciation monitor, very similar to the gravimetric FRM measurement. The third measure is a continuous one-hour average beta gauge monitor. This is a monitor which looks at the attenuation of beta particles through the filter. If you recall from your health physics class (if you had one), attenuation is proportional to mass density. Since we know the density of particles, we can measure mass. It's a very clever way to measure particles. If we just look at the POC1 to POC5 relationship, we see a very strong linear relationship between these two monitors. And the slope is very close to 1, which means there's very little bias between the FRM and the speciation monitor. We can also look at the average, the 24-hour average of the continuous measure against the FRM. And this also is a very strong relationship with a slight positive bias, the slope here being about 1.16, 16% positive bias but a very strong relationship.

The Pearson correlation coefficient is the co-variance of XY divided by the standard deviation of X times the standard deviation of Y. It's computed as shown here. It's the sum of the differences between the X and the mean X times the Y and the mean Y divided by the variance of X and the variance of Y. In this case, using the Excel Pearson function, you point to the X variable which is in column B and the Y variable which is in column E. It returns the Pearson coefficient of 0.98, very strong positive correlation.

The Spearman correlation coefficient is computed by looking at the ranks. So you have to rank the two data sets. You have to rank POC1 and POC5 independently, and then what you're basically doing is correlating the ranks, not the original variable. So you rank the data sets, compute the difference between the ranks. So the rank of data set one minus rank of data set two and then compute the statistics shown here, which is the sum of the ranks squared times 6 divided by n times n squared minus 1. You take that quantity and subtract it from 1.

That is the Spearman correlation coefficient. In Excel it's not particularly difficult to do. The procedure is to sort the whole data set by the first variable and then rank, just have a column of numbers from 1 to n, resort the whole data set by the second variable, have another rank; we call that rank POC5. And again number 1 up to n. When I have these two ranks, I can simply calculate the difference between rank 1 minus rank 2 and then compute the Spearman coefficient. So here's the Excel spreadsheet. What I did after I did the ranks, in this case is I resorted by the data. So these are the last six dates from this data set. And there were 42 days during the year where both POC1 and POC5 operated. And here you see the rank one and rank six. So on the last date, POC1 was the 30th observation, POC5 was 31st observation.

The second to last date POC1 was the 17th observation and POC5 the 13th observation. I calculated the difference between those two ranks, squared them, summed them and computed the Spearman correlation coefficient, in this case 0.977. Remember, large outliers will cause differences, in this case there's very little difference because the data were pretty normally distributed, not particularly extremely distributed. But it's fairly easy to calculate. You can also look at the probability that that correlation coefficient is greater than 0. The null hypothesis here is that the correlation coefficient is equal to 0. You compute the Z statistic. It's the Spearman correlation coefficient times squared equals N. Critical Z, you look up here, one-tailed test of 0.05 probability. It's 1.96. Clearly a Z of 6.26 is much larger than 1.96. So you would conclude that this is a significant correlation. It's pretty obvious that the correlation was greater than 0. Now I want to look at distributions a little bit because we're making some assumptions about distributions here.

And I want to show you some techniques for looking at the distributions graphically and to make some judgments about what we have here. Graphical methods include stem and leaf plots, histograms and probability plots. And we'll look at a number of different plots in the next few slides. I want to point out that there are statistical tests for looking at distributions. You can state the null hypothesis that a data set follows the distribution and look for evidence to reject it. But those tests tend to be very conservative and will frequently cause you to make a poor judgment. I prefer looking at the numbers by graphing them and making a judgment that way. And I'll show you those techniques right here. The simplest plot to do is to simply rank versus concentration. Here I took the Northbrook data. I plotted on the X axis the rank of each case against the concentration. You see that these measurements fall on a smooth line; that there seems to be a continuous smooth distribution, but it's not symmetrical. If this were symmetrical distribution, the two tails would look similar. Now, the distribution that we are probably looking to compare this against is the normal distribution.

So I've re-included here, the plot of the normal distribution. And here it's shown both as a cumulative distribution, the line rising to the right and the standard bell-shaped curve. And I want to point out a few interesting features in normal distribution that are important for evaluating how data actually look. One is that the mean is in the middle of the distribution. And if you go plus or minus one standard deviation, on either side of the mean, you have 68% of the data. If you go plus or minus 2 standard deviations on either side of the mean, you include 95.5% of the data. If you go out two standard deviations to the right, you'll see the tail contains about 2%. If you go two standard deviations to the left, you'll see the left tail is about 2%. So the bottom axis of this distribution can be defined either as a number of standard deviations above and below the mean, plus one, plus two, plus three, minus one, minus two or minus three. Or as a percentile. That's an important concept. So as you're working these problems out, it's very useful to come back and look at this picture, because it will help you understand what the X axis is on a probability plot. Here is the probability normal function.

This is just a mathematical form of the cumulative distribution. The other function that we'll be using is the log normal distribution. Here are three different log normal distributions where the log of the mean of the value is 0 in three different standard deviations. In this case it's the geometric standard deviation of 0.1, 0.5 and 2. So you can see as you increase the standard deviation you spread the data out more. And just superficially, if you look at the Northbrook data, it looks more like the log normal distribution than it looked like the normal distribution, if you just look at grossly the shape of this curve compared to what you just saw, it does look approximately log normally distributed. And here is the mathematical form of the log normal distribution where Y is the log of X. So if you take the original values of X, log them all, and then Y in this equation here is the log of X, and the mean of Y and the standard deviation of Y is computed just like the mean of X and the standard deviation X were computed. It's just that now we're working on the log of the value rather than the original untransformed number.

The first thing to do is the probability plotting method. To create probability plots, I'll show you how to do this by hand using graph paper and then we'll do it in Excel, we start by sorting the data from lowest to highest. And if you have a data set with a large number of observations, you want to put them into equal-sized bins like we did for the Northbrook data. If you have a small data set with only 15 or 20 observations, you don't have to put them in bins, just leave them in their original form. But for Northbrook we put them into these bins of two micrograms width. We then compute the incremental fraction in each interval. So in this case you take the number of each interval divided by the total number in the data set. Now I'm not dividing by N, but I'm dividing by N plus a half. And the reason for that is if I divide by N, the top interval will have, if I accumulate from the lowest to the highest, will have 100 percent. And the 100 percent does not exist in this distribution. This distribution goes from 0 to 100 percent asymptotically, but never actually gets there. So by dividing by N plus 0.5 I'm actually able to put the last data point on my graph. In the real world, not out in infinity. If you try this, you'll see what I'm talking about. If you play with the data.

I compute the incremental fraction at each interval. And then I compute the cumulative percent up to that interval. What we'll ultimately plot is the cumulative percent of the data, less than, on the X axis, a concentration on the Y axis. We're going to put it on this kind of graph paper. This is an example of log probability graph paper where the Y axis is a log scale. X axis a probability scale. I want you, when you get a chance, take a look at this piece of graph paper and then take a look at that standard normal curve that I had shown you before. I want you to see how the X axis is defined on this graph paper. The X axis is linear in standard deviations. If you step off plus 1, plus 2, plus 3 from the 50th percentile, those will be equal distances. If you step off minus 1, minus 2, minus 3, those would be linear. This is a linear X axis with standard deviation but it's probability axis in percent, and in this case the Y axis is log.

This paper is a little bit hard to find these days. So I've scanned these into PDF files and will put these on the website as well so you can download and print. But, again, I want you to look at the X axis and look at the probability along the X axis compared to the standard normal curve to see how it's defined. So here's the Northbrook data. I know that there's four counts less than or equal to four micrograms per cubic meter. And so that represents 0.0338 or 3.38% of the data. I know that there are 20 counts less than or equal to 6 micrograms per cubic meter. Those 20 counts represent 16.9% of the data. I know that there are 40 counts less than 8 micrograms per cubic meter and those 40 counts represent 33.76%. So I'm plotting the cumulative fraction as a percent which is the right-hand column against the maximum concentration, which is the left-hand column.

I'm going to hand plot it on probability graph paper. And here it is. There are 22 data points. And I just plotted them by hand. The left-hand scale goes from one to 100. The bottom scale is this percentage. And what I've done here is I've just taken a ruler and eyeballed the straight line through these data points. It looks like a very straight line, which tells me that this data is close to a log normal distribution, because I've done a normal transformation on the X axis, log transformation on Y axis. It's log normally distributed. From this graph you can read the geometric mean; geometric mean is the 50th percentile. You see on that graph I drew a vertical line at 50th percentile, which is right at 10.8 micrograms per cubic meter, and the geometric standard deviation is the ratio of the 84th percentile to the 50th percentile. So I've shown you the 84th percentile is 18.6 micrograms. So 18.6 divided by 10.8 gives you a geometric standard deviation of 1.72. Not too hard to do, and again download and print the graph paper.

You can use that no problem. However, we can also use Excel to automate this process. In Excel, you do the same four steps, sort the observations, bin them; if you've a large number of observations, compute the incremental fractions and compute the cumulative fractions. Then you take the last step as you transform the cumulative fraction to a Z score. That cumulative fraction or percent is some number of standard of deviations above or below zero. To do that we use Excel standard inverse or norm S inverse function. NORMSINV function. And if you down load the Northbrook data set you can see how I've done that. Here are some of the results where I've taken the cumulative number and I've computed a Z score. Now 0 is undefined. So there isn't (a zscore), if you tell Excel give me the Z score for cumulative number fraction of 0, it will say error. But it's actually minus infinity, which is not a location on your graph. However, if you look at the second interval, 2 to 4, the cumulative number of observations less than or equal to 4. That represents 3.3% of the data. And that's minus 1.82 standard normal scores to the left of 0. The third bin, 4 to 6 has a cumulative count of 20, 16.8%. 16.8% is a Z score of minus 0.9, or just about one standard deviation to the left of 0. And then you can see here I've got some of the cases in the spreadsheet, I've got all of the cases.

Then I just simply told Excel, plot from my X axis the Z score, which is the normal transformation, and plot for my Y axis the concentration, and I get a probability plot. The first probability plot I've given you here is the normal probability plot, where I haven't log transformed the Y axis, just to show you that this data is not normally distributed. If this were a normal distribution, the points on a normal probability plot would be straight. But clearly these data points are curved. Not a straight line. Even though Excel can fit a straight line, it's a very poor fit. I then simply tell the program that it's probably a log distribution, as are most environmental data sets, so I just tell Excel to log transform the Y scale. When you do that, you get a plot that looks like this. I've told Excel -- you sometimes have to go in there tell it to do a lot of things to make the graph look pretty. I simply told it to give me two orders of magnitude on the Y axis, one to 100. And scale of the X minus 3 standard deviation scores to plus 3.

And you see that these data points look just like they looked on the hand-created plot. They fall in a nice straight line. Excel is able to fit this line for me and draw it. I don't have to draw it with a ruler. I can draw it with the Excel function on the curve fit function. Just recognize that because this is a log transformed scale, I must use an exponential function to draw a straight line on this graph. So I have -- this distribution is fit with a geometric mean of 10.58, which is where the line crosses 0. And the geometric standard deviation or the slope is E to the 0.5542 power. If you simply put 1 in for X, one standard deviation for X, either the 0.5542 times 1 gives you the slope. And I've shown it right here. Geometric mean of 10.58 micrograms per cubic meter. Geometric standard deviation of E to the 0.5542 or 1.74.

This is in excellent agreement with the hand plot. The two graphs looked almost exactly the same, and the results are almost exactly the same. The geometric mean from Excel was 10.6, from the hand plot 10.8; geometric standard deviation from Excel was 1.74; from the hand plot, 1.72. So don't be shy about doing things by hand. That's great. If you feel comfortable with Excel, that works as well. But this is a picture of the distribution, pretty strong evidence to suggest that in this case the Northbrook data is log normally distributed. Now if you prefer the histogram as a way of looking at distributions, that's okay. We can actually use the model we just created in Excel and reproduce the histogram of the data, assuming it was normally distributed, or log normally distributed. So we take that exponential trend line from the log probability plot. We know the geometric mean. We know the geometric standard deviation.

We can compute the Z score location for each point on the line and compute a predicted fraction resulting at that location. So we can -- given that fitted line, what's the expected value in the 0 to 2 bin? What's the expected value in the 2 to 4 bin? And what's the expected value in each of the bins? We can compare the expected value to the observed value to see how well our data follows a log normal distribution. In this very busy spreadsheet, that's just what I've done. I've taken the predicted equation from the spreadsheet; and if you look into this, just download and look at the spreadsheet, you'll see how I've defined the predicted fraction simply using that equation. I have a predicted fraction, assuming this were log normally distributed, predicted cumulative number, assuming it's log normally distributed by multiplying by n plus a half, then a predicted count, how many incrementallly should be in each bin by taking the difference between two cumulative predictive numbers. I can then graph the predicted counts along with the real counts against the concentrations and make a histogram.

So just before I show you the histogram, I want to show you the log probability plot of the data which shows how it's log normally distributed. And here are the predicted counts. And this graph, the predicted counts are right next to the observed counts. And you'll see the predicted counts are beautifully distributed, smoothly distributed. You see a little bit of overprediction, a little bit of overcounting at low values and undercounting at high values; but, nonetheless, the data follows this log normal distribution quite nicely. I think this is probably the better way to actually show how close your data fits a model distribution. But it's a little bit more cumbersome to compute. But just using my example, it's not -- ultimately won't be that hard to reproduce on your own examples. So now that we've looked at distributions, I want to then get back and talk for the rest of the remainder of this lecture on the quality assurance and what the project plan is. And I have a couple of really useful examples that I think you'll be able to apply and use as examples for designing your own quality assurance plans.

Now, what does the quality assurance project plan require? It requires a number of things. One is that you have technical and quality objectives that are identified and agreed upon. So before you begin a process, you want to agree on what's the precision and accuracy of the underlying measurements. You specify what your goals are in your quality assurance project plan. And then as you design your study, you design measurement ways, samples; you build in samples that you need in order to measure these goals. So you need to have measurements. You need to have data. You need to acquire things that are appropriate for achieving these project objectives. You then need procedures to assess if you have sufficient data. And the data are of sufficient quality to meet these stated objectives, and then look at any limitations in the data.

So the first example I'm going to look at is a project being run here out of Region Five by the five Region Five states. I'm sorry, the six region five states. We have a brand new, just recently started, air toxics network. We've been collecting data now for about two years for toxic metals, volatile organics and aldehydes. Each of the states runs one or two sites and looks at the analysis of their data within their own state laboratories. And so the first question that the states were asking each other are our laboratories equivalent, or do we have systematic bias issues. Does one state always measure something lower than the other state, or are there precision problems; is there a lot of noise in this data set. So the states agreed to do a comparability program where we're going to define here comparability, the ability to meaningful compare data collected at the same time, same location by the different laboratories.

So the state of Wisconsin volunteered to set up a test bed where they had a long manifold. They put multiple samplers on this manifold and collected simultaneously a sample for each of the laboratories. They did this for metals, aldehydes and VOCs. They took those samples and sent them off to different laboratories and they were analyzed. And they were in fact analyzed by anywhere from two up to seven simultaneous laboratories. I'll show you the difference as we probe into the data. So each state uses standard methods to analyze their samples. And within the laboratory they all have their calibration and control procedures that are appropriate. But the question we're trying to ask is a higher level question: Are the ultimate measurements that each of these states producing equivalent, or are they not equivalent? So we have this external audit sample that we're applying to these states. We're going to do this by using a direct exchange of ambient air samples to the different states. And we're going to look at two levels of agreement.

We're going to look at qualitative and quantitative agreement. Qualitatively we want to know if we are detecting the same pattern of observations. Are all the labs finding benzene or not? Are all the labs finding toluene or not? This is qualitatively a pattern of detects versus non-detects the same. Quantitatively we want to ask, are we getting the same answer. If lab A says it's six, does lab B also say it's six. So we want to look at the data both qualitatively and quantitatively. Here's an example of VOC from one comparison. This was a set of canisters collected on December 13th, 2003 that were sent to Indiana, Ohio, Wisconsin, Michigan and Minnesota. Indiana analyzed this canister twice. And you can see that there's reasonable qualitative agreement. For example, if you look at benzene, you'll see that all the states found benzene in all the samples. In contrast, if you look just below that to bromoform, none of the states found bromoform. So in general, the pattern of detected versus non-detected looks similar. Qualitatively, these states are in agreement.

Quantitatively, we can look at manganese, for example. This was a particular sample filter that was broken up into sections, sent to different states. Indiana found 48 nanograms on this filter; Wisconsin 44. And so the relative percent difference in this case is computed, as we showed you in the past lecture, 48 minus 44 divided by the mean, or 8.4%. A very high precision for manganese between these two states. And, in fact, if you look at the same thing between Indiana and Wisconsin, even better precision, 6%. You can even compute this relative standard deviation between each individual state and the mean.

So no matter how you look at manganese, you see very high precision. But that's not, maybe not the whole picture. And so what we've done is we've created a graph here which shows the results for a number of metals simultaneously across these state laboratories. And this graph shows that the tallest bar on this graph is the manganese concentration. It shows excellent agreement between the six states; that they all measured roughly the same manganese. However, the picture is not so clear when you look at the other metals. So if you look at nickel, cadmium, lead and arsenic, you'll see substantial differences, not only in level, but even in reporting. Some states found these metals; other states didn't find these metals. So if you looked at the measures of relative percent difference for these other metals, they would be much greater than were the case for manganese. So graphing data is a very nice way to look at comparability.

Look at VOCs. VOCs were done with canisters and up to six laboratories and the federal ERG laboratory, the federal contract laboratory, each got canisters. They did one canister exchange in 2002, two exchanges in 2003 and one more exchange in 2004. And what I've done for each compound that came out of each one of these exchanges, I've calculated a mean, a standard deviation and coefficient of variation and plotted these coefficient variations versus the sample mean. You can see the precision of this method. Now, each of the state labs used a similar method with similar equipment. It was typically TO-14 or TO-15. PAMS is a variation on that. But the equipment was slightly different. Either HP or Perk and Elmer gas chromatographs were used . Some labs used mass selector detector, and other labs used FID. They all used the DB one column. Some dried with a nafion dryer, and others did not.

There were some differences in the methods. So part of the reason why each lab reports a different answer may be these kinds of differences. This is called the combined VOC data set which you should download and take a look at. But what I thought was fascinating, if I simply plotted the mean value versus the coefficient of variation, you get this interesting pattern. For those compounds where the average concentration was greater than 1 ppbv, (this was typically toluene), the relative percent standard deviation was almost always less than about 20%. So a precision of about 20% defined by the relative standard deviation is achievable for concentrations above 1. For concentrations above about a quarter part per billion, 40% is an achievable number. There's only one relative percent deviation greater than 40% when the concentration was greater than 0.25 parts per billion.

I've been asked many questions, how good is good? Here's a way to actually, to specify good. I think you could achieve 20% relative standard deviation for concentrations above 1 and perhaps you should specify 40% as your goal for concentrations above a quarter. But when the concentration gets lower than a quarter, what happens? The relative coefficient variation explodes. And it goes well over 100 percent. What's happening here, if you recall from our last lecture, when we talked about signal-to-noise, what's happening is as our signal gets lower and lower, the noise, which is relatively constant, becomes a large percentage of the signal. In fact, it exceeds the signal. And so as we go to these low concentrations, we're looking at noise, not signal. So the relative standard deviation or the precision explodes, because we're just looking at random noise, not actual signal. So this kind of result, I think, is what's to be expected.

And it confirms to me that the Region Five laboratories are doing an excellent job at measuring VOCs and it gives us benchmarks we can use for specifying data quality objectives in future studies. This is a fantastic, really useful exercise and the states are still doing this. As additional samples are collected I will continue to update these spreadsheets and so in the future I'll try to put even more data points on this kind of graph. Now, aldehydes were done a little differently. For 2003, the aldehydes are collected in pairs. So we had 17 pairs of samples in 2003. Each time a pair of samples were collected they were sent to two of the six laboratories. In 2004, they changed to a manifold where they collected seven samples simultaneously and sent those seven samples to each of the laboratories.

So I'll look at the two different sets differently because they're kind of different in the way they were collected. So for the 2003 data I'm just looking at the paired data; and, arbitrarily, one of the two samples was identified as primary and the other one secondary. So if you look at something like formaldehyde, excellent agreement. All the labs always found formaldehyde well above the detection limit and the reproducibility was great. In contrast, butyraldehyde was not consistently found. If you look at the scatter plot, there's essentially no relationship. So for formaldehyde and acid aldehyde, look in the spreadsheet and you can see these, excellent agreement. For all other aldehydes, the picture wasn't so clear. But in 2004 they changed the protocol where they sent the same sample to many laboratories. So here I'm able to calculate a mean and a standard deviation and a relative standard deviation. So here I'm able to plot those relative standard deviations on the right against the means. And for 2004 the graph looks just like the graph for VOCs.

For concentrations above a half of a microgram per cubic meter, 20% relative standard deviation is achievable. Those three compounds I think were formaldehyde, acid aldehyde, and I'm not sure, maybe the third one was -- I don't recall what it was. But for those aldehydes that were at much lower concentrations, the noise became a much more important factor and relative errors increased. So here's a way to specify a measurement goal of 20% relative standard deviation for concentrations greater than about a half.

This is the same idea if you remember we talked last time about looking at the precision of the PM 2.5 sample analysis program. When you calculate precision, you don't include any samples less than 6. That's because samples less than 6 are too close to the -- are down near the left-hand side of the distribution where there's a lot more noise affecting the sample. So by truncating at six, you're cutting away the noisy part of the distribution. Here you can see where the noisy part of the distribution begins. It begins at about half a microgram.

My final example is taken right out of the EPA website on ambient air quality. If you go to the AMTIC website, you can look at this document on PM 2.5 quality assurance. This is the three-year quality assurance report for 1999 through 2001. For the SLAMS, the basic monitoring network for PM 2.5. This is an interesting case because 1999 was a start-up year. So you're looking at this network when it first starts up through its first years of operation. So it forms a perfect test example. So what were the PM 2.5 data quality objectives? Before this network was operated, before it was placed out in the field the EPA defined how it was going to make its judgment as to whether or not they were successful. They decided -- they made a hypothesis that the annual mean would be the controlling standard, not the 98th percentile. That turned out to be very true, and that the three-year annual average was the true value. Just like we did for our PM 10 example in the last lecture.

So if you collect a sample every day of the year for three consecutive years with no missing values, the average of those three samples is the true value. They stated as a goal that bias and precision would be no more than 10%. That there would be no spatial uncertainty; each monitor stands on its own. And that they base this whole network on one in six day sampling and they had a goal of 75% data completion. Now it turns out they exceeded all of these. I'll show you by example. They also hypothesized that this data set would be log normally distributed, with a population coefficient variability of 80%. Turned out to be a little less than that. But that the errors, the measurement errors would be normally distributed, which is probably a good assumption. That the seasonal ratio of 5.3, which turned out to be a little too high, and that there would be no auto correlation in the data.

There is some auto correlation in this data set because they didn't use one in six day sampling. If they had stuck with one in six day sampling, I'm sure there would be no auto correlation. Because they sample more frequently some auto correlation did creep into the data set. I'll have a whole lecture on time series in the future for you to dig much deeper into this concept of auto correlation. And they were going to place the decision errors at 5%. That is that they wanted 95% confidence in their decision.

So what do these data quality objectives lead to? These data quality objectives lead to these curves. The left-hand curve is a curve assuming your measurement has a 10% positive bias. That is if you look at the 50th percentile, 15 would actually be measured as 13.5. So if you look right across the 50th percentile from the standard of 15 to the positive bias, you're down to a 13.5. And a negative bias is the right-hand curve, 15 up to 16.5. Now the fact that these curves are sloped is because they're sampling or they were hypothesized to be sampling on a one in six day schedule with a 10% precision. If you had a 0% precision, sampling every day, they would be vertical lines. And if you had zero bias, they would be vertical lines right on the standard. But they drift away from the standard because of the bias and they slope because of precision and sampling statistics. Just like we showed with PM 10, there's this gray zone, this area in between, where you can't use the data to conclusively state that the nation is above or below the standard.

This was their data quality objective; they wanted the gray zone to be no larger than this. Now, the effect of sampling frequency is dramatic. This graph shows the power curves just as you change sampling frequency from one in six days to one in three days to every other day. And as you increase the sampling frequency, you'll see that you decrease the gray zone area by making the curves much more vertical. And this is exactly the same thing we showed in the PM 10 sample where I sampled it with different frequencies compared it to the true value, the once-every-day sample.

How well did we do? We did quite well as a nation. Almost all regions had data in all quarters. So here you see that data was available from all regions and all quarters. And that all regions exceeded the 75% completion criteria. So we all met that initial goal of 75% data availability.

How do we do for precision and accuracy? The acceptance criteria for precision was 10% based on co-located samples; nationally, over the three years, we had about a 7.2%. So we beat that by about 30%. And for bias, we had, as our national goal, no more than 10% bias and our grand average bias over the three years was only 2.1%. So we did an excellent job in terms of precision and accuracy with this national program. There were slight differences if you sort this data by sampler. So if we look at the precision sorted by sampler, you'll see that the Anderson single stage sampler had a little bit of problem compared to the other ones. Most of the samplers were sequential samples. So the Anderson sequential and the R and P sequential were the bulk of the instruments, but single stage Anderson appeared to have a slight precision problem. If you look at this, the bias over time, you can see the start-up pains in this network.

Back in the first quarter of 1999 this was a new idea. And when we first started operating, we had some serious problems with the Anderson sampler. I've talked to the folks here in Cook County, and this problem was -- the blanks were becoming contaminated. In other words, just actually having the sample, the filter, sit in the sampler over time caused it to gain mass from passive loading and by changing the sample cleaning protocol and changing the filter in the ventilation system, that kept the inside temperature the same as the outside temperature, they were able to completely eliminate this bias. So you see by the third quarter of the first year the bias was essentially zero for the Anderson sampler. Sort of an expected start-up issue, but ultimately we did fantastic. So how did we do? The hypothesized seasonal ratio of 5.3 was wildly optimistic. In fact, the typical seasonal ratio from the highest to the low season was about 2.2.

The population coefficient of variation turned out to be 0.58, as an average. My PM 10 example from Chicago from one monitor in Chicago was 0.52. So agrees, PM 2.5 and PM 10 have similar variability; and 0.8 was a little bit high in terms of the estimate. Because many samplers were operated on a one in three day basis, a little bit of auto correlation, correlation coefficient between successive samples averaged about 0.1. But sampling frequency was typically one in three days. Completeness was 83%. Bias only 4%, and coefficient of variation or noise 7%, which meant that the gray zone, the goal, the hypothesized gray zone from 12 up to 19 in reality was about 14 to 16. So we exceeded substantially the data quality objectives; and, in fact, this graph shows that the decision error gray zone in the middle that we actually had was far better than the stated goals for the program. And primarily this was due to better precision, much better than 10%, and one in three day sampling and substantially better bias. Very, very little bias. That is, we measure on average the true value much closer than the stated goal. As this program matures, if we looked at the next three years, I would expect this to even be a little bit better. Where to go for more?

Here's a series of Internet references where a number of these examples were taken from. We'll put this on-line for you to download and work with. Just as a quick wrap-up for this lecture, we've looked at a number of basic tools for looking at data. Statistical quantities. Relationships, means, standard deviations, correlation coefficients and how we use those to actually look at numbers in the context of data quality objectives. How good do you want your data to be? And I showed a number of important examples both done here within the region and nationally on toxics and PM 2.5. I hope this lecture ultimately is useful to you. The key for us in preparing this work was preparing examples from the real world that illustrate our points. And so this lecture will only work for you if you download those and take a look deep into the spreadsheets. Use what we've done. Send us your ideas and suggestions. I'd love to have more data sets, more of the regions to look at, and share with you through this series of on-line courses. We have plans for six more lectures and hope to see you early next year with additional work. Thank you.