GEM-SET : Girls' E-Mentoring Program : Science | Engineering | Technology
Home
Welcome
Mentors
Partners
Calendar of Events
Daily Digest
Contacts
SET Links
FAQs
Daily Digest Archive

Daily Digest Archive for October 5, 2004

RESPONSE (Initially posted October 5, 2004) FROM STUDENT MEMBER JOSIE K. IN GA
This is in response to the questions that Ms. Harbert asked me in her
response (Thank you again by the way).
1. the relative humidity is discrete (with many pieces of data)
2. the elevation is discrete (also with many pieces of data)
3. I do not have any special statistical instruments, but I do have
Microsoft Excel.

October 5, 2004
A: FROM MENTOR DENISE HARBERT IN IL
Hi Again Josie! This is a long answer, but it's almost a semester worth of
statistics information in a few pages. I'll be sending Excel directions for
you as well, but this answer explains the statistics ideas in English to help
you understand the formulas that Excel will calculate for you.
You're definitely on the right track with your "discrete" answers. Figuring
out what kind of measurements you have and what statistic you need to compute
is definitely one of the hardest parts in data analysis. Doing the math is
easy in comparison to figuring out which math to do. Here's proof - you even
had me stumped for a while! And I have 2 college degrees in statistics and
math! I was expecting you to answer "continuous" for the variable types.
When you wrote "discrete" for relative humidity, I stopped and thought that
you might be right and I might be wrong. Relative humidity and elevation both
can be meaningfully ordered from low to high values, which means they are not
binary and they are not categorical. The question then becomes whether or not
they have gaps in between. If they have gaps, they are discrete. If they do
not have gaps, they are continuous.

I know that relative humidity is a percent, which means it is the ratio of two
numbers (some number divided by some other number). If those two numbers are
integers (whole numbers without decimal points), then relative humidity would
be what is called a "rational number". Rational numbers are "infinite", but
they are "countable" because they go on forever but they can be mapped
one-to-one to the set of all integers. Integers have gaps in between them and
are discrete. If rational numbers can be mapped one-to-one to integers, then
rational numbers also have gaps in between them and are therefore discrete.
(e.g., The square root of 3 is not a rational number, but it's between 17/10
and 7/4, which are rational numbers. Therefore, the square root of 3
represents a gap between two rational numbers.) I then did an internet search
to learn more about relative humidity and to try to figure out whether or not
it is a rational number.

My internet search turned up 3 web sites I want to give you. One is "Hyper
Physics" for high school and college students, sponsored by the Department of
Physics and Astronomy at Georgia State University
(http://hyperphysics.phy-astr.gsu.edu/hbase/hph.html#hph). You might consider
going there since you live in GA! I also found a great summary page about 3
properties of the atmosphere (temperature, pressure, and moisture) at
http://www.acleanerenvironment.com/variables.html. This last site was created
by an Earth Science teacher named Mr. Tuomey, whose main web page is
http://www.acleanerenvironment.com/. He has some interesting things on his
web site, but they are specifically designed for students he teaches in
person, so they don't all make sense without extra information from his
classroom. I don't know this teacher, so I would not recommend that you email
him directly for more information. (Best to be safe when surfing the
internet!) Finally, the National Oceanic and Atmospheric Administration main
page is www.noaa.gov. You could spend weeks navigating through all of the
links there, but their education link is http://www.education.noaa.gov/. If,
after reading through the education web site, you still have atmospheric
science questions, you could try emailing an NOAA Outreach Unit Coordinator at
noaa-outreach@noaa.gov. The Coordinator might be able to forward your
questions directly to an NOAA atmospheric scientist!
In all of this web research, I learned that relative humidity is influenced by
temperature and atmospheric pressure. Elevation is related to atmospheric
pressure, so you've indirectly covered that part in your project. However,
you might want to consider collecting three variables instead of two. It
shouldn't be too much extra trouble to measure temperature in addition to
relative humidity and elevation, so you could do all three at once. That way,
if you can't find a good relationship between elevation and relative humidity,
you should almost certainly find a good one between elevation, relative
humidity, and temperature without having to completely redo your project and
collect more data.
On the original subject of discrete or continuous, the formula at the top of
the Hyper Physics page at
http://hyperphysics.phy-astr.gsu.edu/hbase/kinetic/relhum.html says that
relative humidity equals actual vapor density divided by saturation vapor
density. Both the numerator and denominator densities are typically measured
in grams per cubic meter (gm/m^3). Grams and meters - weight (mass) and
distance - are both real numbers (they can be measured with rational and
irrational numbers), so they are continuous. Therefore, relative humidity is
a real number and is continuous because it is the ratio of two real numbers
that are continuous. Elevation is really just a distance measured vertically
instead of horizontally and distances are continuous, so elevation is
continuous. This may seem like a surprise because weather people on TV
usually round these measurements to the nearest whole number. "Relative
humidity is 53%." But rounding a variable that could theoretically be
measured to an infinite decimal point (continuous) is a different idea than
having a variable that is impossible to measure to an infinite decimal point
because there are gaps in between the points (discrete).
That's probably more than you ever wanted to know about numbers! Maybe you
can use some of this as background information or an appendix in the research
report that you write to summarize your results. These two questions I gave
you were not easy, so don't feel bad that you didn't get them on the first
try. I didn't get them first either - I assumed continuous without thinking
through all the details first!

The short answer is that both relative humidity and elevation are continuous.
Now, what statistics should you use to explain the relationship between two
continuous variables? Chi-squared is a "test statistic" that is designed
almost exclusively for categorical data. A few chi-squared tests can be used
on discrete data, but usually those are used for discrete data without very
many possible answers. Discrete variables with a lot of possible answers
(like money and sometimes age) are often analyzed like continuous variables.
I have never seen a chi-squared test performed on continuous variables, so I
recommend that you do NOT try to use chi-squared for your project. Someone
probably told you about chi-squared because it is currently being taught in
most high schools as a student's first exposure to "test statistics".
Teachers may think that students will understand categorical variables better
than continuous variables. I learned it the other way and thought the other
way made more sense - continuous first, then categorical last. But I was good
in math, so thinking about averaging numbers and looking at them on an XY plot
(taught in algebra 1) made instinctive sense to me. It was very hard for me
to think about numbers that represent categories, but cannot be put in order!
Learning about statistics starts with samples and populations. Populations
are usually so large that you cannot measure them all. Samples are what
researchers collect when they measure some of the population. The "sample
size" is the number of observations you make when collecting your data.
Sample size is usually labeled with an "n". Sample size is NOT the number of
numbers you record. It is the number of measurement sets. In your project,
you will need to measure relative humidity at the exact same time and place
you measure elevation. These two numbers together represent one observation
because you observed them at one place and time. If you climb to a different
elevation and again measure relative humidity and elevation, then those second
two numbers would be a second observation. In psychology research, one
observation might represent one person who answered several questions
(variables). In chemistry, one observation might be one beaker of fluid that
you took several measurements on (variables). Typically, you want to have at
least 30 observations (n >= 30) if possible. You also want your observations
to be spread out as far as possible, which means measure high and low relative
humidity and high and low elevation.

Here is where I started writing step-by-step directions to explain what you
need to do in Excel and I ended up with 6 solid pages in 10 point font before
I stopped half way through my directions! Obviously, that was way too much
information to put on the Daily Digest without boring the daylights out of
readers other than you! Instead, I'm going to mail a stack of Excel
directions to Sarah Shirk at GEM-SET so she can put your full name and address
on it and mail it to you. (I used to work for GEM-SET, so I know that your
personal information is kept in locked files and away from the internet for
your safety!) Please email Sarah at GEM-SET@uic.edu and give her your last
name and post office mailing address (including zip code) to be sure that she
has it. She will NOT post it on the Digest or share it with anyone, so your
identity will be protected! For the rest of this email, I'm just going to
describe what statistics you should calculate without giving too much detail
about how to calculate them in Excel. You and the other readers should know,
however, that the reason why it took me so long to type Excel directions is
because over the years I have found some programming errors in some of Excel's
statistics functions. They don't always give you the numbers that you're
supposed to get. (Excel was not designed by statisticians for statistical
analysis. Excel is excellent at the spreadsheet tasks it was originally
designed to do, but the statistics in Excel are mostly an after-thought
programmed by computer scientists who did not specialize in statistics.)
With continuous variables, you should always start with computing the
"measures of central tendency", which are statistics that describe the center
of your variables. The most common are the mean, median, and mode.
The mean is what you think of as the average. Statisticians usually say "x
bar" and write it as an "x" with a flat, horizontal line over the top. (I
can't type one here because the Daily Digest won't take special characters.)
You can compute means in Excel with "=average(<cell references of your
variable>)". To compute the mean manually, you would add up all the numbers
and divide by the sample size. When you have at least 30 observations in your
sample (n >= 30), then your mean will have probabilities (how likely will it
happen) like the normal distribution (a specific type of bell-shaped curve).
This is called the Central Limit Theorem. Knowing this allows you to compute
a normal distribution "test statistic" called the Z statistic and then a
p-value. P-values tell you how likely it would be that you got your sample
mean if the population mean of relative humidity, for example, was actually
something different than what you measured. (See my Daily Digest answer See
my answer to the question "What is calculus? What is statistics?" in the March
12, 2002 Daily Digest at http://www.uic.edu/orgs/gem-set/march.htm for a more
thorough explanation.)

The median is the middle order statistic. Statisticians say "x tilde"
(pronounced till-duh) and write it with an "x" that has a squiggle over the
top like a narrow, sideways, backwards "s". If you were to put your elevation
numbers in order from smallest to largest, the median would be the number in
the middle. If you have two numbers in the middle (if your sample size is an
even number), then the median is the average of the two numbers in the middle.
In Excel, the function is "=median(<your variable>)". The median does not
have as many "nice" mathematical properties as the mean does, but it can be a
much more reliable measure of center because it is not sensitive to outliers
(numbers that are a lot different than the others).

The mode is the most frequent number in your data. Statisticians rarely label
it because it has virtually no mathematical properties, but some books call it
"x prime", like an "x" with an apostrophe after it. With continuous data,
there may not be a mode. In Excel, the function is "=mode(<your variable>)".
After central tendencies come measures of variation. The deviations of
elevation would be computed by taking each elevation and subtracting the
elevation mean from it. These are how far each number deviates (differs) from
the mean. Square each of the deviations (3 becomes 9 and -2.1 becomes 4.41)
and you'll have the squared deviations. Add the squared deviations and you'll
have the sum of squared deviations. Divide that sum by (n-1), which is your
sample size minus 1, and you'll have the "variance". The variance is an
important statistic to report in your analysis. Statisticians usually say "s
squared" and label it with "s" and a superscript "2". In Excel, the function
is "=var(<your variable>)". Take the square root of variance and that is
called the "standard deviation", which is also an important statistic to
report. Statisticians usually label it "s". In Excel, the function is
"=stdev(<your variable>)", OR "=sqrt(var(<your variable>)).
The standard deviation is a measure of spread in your data. It measures how
far a typical value deviates (differs) from the mean. If your data are
normally distributed (a specific type of bell-shaped curve), then about 68% of
your values will be within one standard deviation of your mean. In other
words, about 68% of your values will be between (xbar - s) and (xbar + s). In
normally distributed data, about 95% of your data will be between (xbar -
1.96s) and (xbar + 1.96s). This is where a "95% confidence interval" comes
from, which is a range of values that are reasonable estimates of what
elevation or relative humidity would be if you randomly sampled one
observation from your population. This interval is not used in practice very
often because it requires normal data. A more common interval is a 95%
confidence interval for the mean (xbar - 1.96s/sqrt(n), xbar + 1.96s/sqrt(n)).
This is a range of values that are reasonable estimates of what the true
population elevation or relative humidity mean is. Note that the differences
between the two 95% confidence intervals above are the interpretation, the
sqrt(n), and the requirements needed for the interval to be valid. The
interval for one observation needs the standard deviation by itself and the
data must be perfectly normally distributed. The interval for the mean of all
observations needs the standard deviation divided by the square root of the
sample size and EITHER the data must be perfectly normally distributed OR the
sample size must be large (n >= 30).

In your project, you will have two variables to calculate mean, median, mode,
variance, and standard deviation for. It is easiest to use X and Y (e.g., "x
bar" and "y bar") to distinguish them. I would choose elevation as the X and
relative humidity as the Y because elevation is fixed at any particular place
on earth and relative humidity changes there. It seems to make more sense to
use elevation (X) to predict relative humidity (Y). However, if you were to
measure atmospheric pressure instead of elevation, then pressure could be the
X or the Y because pressure changes at fixed points on earth.
All of the statistics above are called "univariate statistics" because they
all describe one variable at a time. You will also need to calculate some
"bivariate statistics" because you will need to know how two variables
(relative humidity and elevation) relate to each other.

One bivariate statistic that measures how your relative humidity and elevation
vary together is the "covariance". Statisticians usually label covariance in
one of two ways, either cov(x,y) for "covariance of x and y" or with an "s"
and a subscript "xy". (Sometimes the superscript "2" is added and sometimes
not, depending on the book.) To compute the covariance, multiply each x
deviation by its corresponding y deviation for all observations, sum the
products, then divide by (n-1). Variance and standard deviation are always
positive, but covariance can be negative. Covariance is positive if both
variables increase or decrease together. It is negative if one variable
increases while the other decreases. Covariance is hard to interpret because
it is measured in weird units - it is the product of the two original units,
in your case, (%)x(meters) or (%)x(feet). Covariance is one of the functions
that Excel got wrong - it uses the wrong denominator. You'll have to adjust
Excel's function a little. Use "=covar(<your x variable>, <your y
variable>)*n/(n-1)", but replace "n" with the number that is your actual
sample size.

Your final step will be linear regression analysis. This is a pretty advanced
statistics method, so you may need to find a statistics teacher to help you.
You'll want to start with a scatterplot in Excel that is just like an XY plot
in algebra. Each point on your grid will be an observation with your X
variable on the horizontal (left-right) axis and your Y variable on the
vertical (up-down) axis. When you plot your points, you should see a pattern
that might be shaped like a curve or a straight line. Regression is a type of
statistical analysis that tries to put the best fitting equation through your
points. You tell the computer what type of model you want (e.g., a line or a
certain type of curve) and the computer tells you what the best equation is
for the type of model you asked for. IN YOUR IMAGINATION (you'll need to know
calculus before you have enough math knowledge to do this for real), put a
straight line through your points. Now imagine measuring the vertical
distance between each point and the line. The vertical distance is called a
"residual" or an "error" because it is the amount left over after accounting
for the line. The residuals/errors are the same idea as the deviations
described above. The only difference between a deviation and a residual/error
is that the deviations are associated with univariate statistics where the
mean stays the same for each observation, while the residuals/errors are
associated with bivariate or multivariate equations where the mean changes
from one observation to the next. In regression, the mean is actually the y
value on the line, which changes for each value of "x". Measuring the
vertical distance between any point and the line is the same as subtracting a
mean "y" from the point's "y" value. If you square the residuals/errors then
add them up you will have what is called the "the sum of squared errors" or
"residual sum of squares" or "residual SS" because it is the sum of the
squared vertical distances between your line and your points.

Regression computer software finds the equation of the best fitting line to
your points by minimizing the residual sum of squares, while assuming the
residuals will add to 0. In other words, the computer finds the line that has
the smallest vertical spread of points around it, also called the smallest
variation around the line. The computer also calculates a "test statistic"
called the F statistic, its p-value, and a bunch of other measures. Here is
where a lot of people make mistakes. Many people start interpreting the
numbers as soon as the computer calculates them, which is WRONG. The model
equation you forced the computer to calculate might be the wrong model. For
example, maybe you need a curved model instead of the imaginary line you drew.
You have to check to see whether your model fits. The best way to do this is
to make three different types of XY plots of the residuals.

Plot observation number on the horizontal axis and residual on the vertical
axis. The residuals should be independent from each other, so you should see
no pattern in your plot (points should be randomly scattered).
Next, look at the "residuals vs fits" plot. The y values from your line
(called the "fitted y values" or "predicted y values") should be on the
horizontal axis and the residuals should be on the vertical axis. You should
see no pattern (points should be randomly scattered). If the plot has a
funnel shape, where the vertical spread of points is small on the left and
large on the right (or vice versa) then your variance is not constant. If
that happens, you should convert your X and Y variables to natural logs, ln(X)
and ln(Y), and try fitting a line to ln(X) and ln(Y). In Excel, the function
is "=ln(<your x variable>)". If the residuals vs fits plot has a curved
shape, then the straight model you started with is not right and you should
try fitting a curved equation instead of a straight line.

Finally, look at a normal probability plot of the residuals. (Note that this
is another thing Excel got wrong. The plot that Excel calls a "Normal
probability plot" is actually a "Normal cumulative distribution plot", which
is very different. You have to do several things manually to get the correct
plot in Excel.) The real normal probability plot has the estimated quantiles
(also called rankits) on the horizontal axis and the residuals on the vertical
axis. The estimated quantiles are normally distributed. The residuals should
be normally distributed. If the residuals are normal, then your plot will
look like a straight line (like y=x, or normal=normal). If the residuals are
not normal, you will see a notable curve in the plot.

Once you find a model that makes the three plots above look good, then you can
start interpreting the regression numbers that Excel calculated for you. The
F statistic's p-value compares your model with X to a flat, horizontal line
without X. If the p-value is very small (< .05), then your model is better
than a flat line and X and Y are related to each other. (Note that Excel
prints very small numbers in scientific notation so 4.2E-05 means 4.2x10^(-5),
or 4.2/(10x10x10x10x10), which is much less than .05!) If the p-value is very
small and your X and Y are related, then the model equation is creating by
multiplying each of the "Coefficients" times its variable name and adding
them. Next, "r squared" is what statisticians call the "coefficient of
determination" and it is an estimate of the variation in Y that is explained
by your model equation. If r squared = .85, then about 85% of the variation
in Y is explained by your model equation. Finally, "r" is what statisticians
call the "Pearson correlation coefficient". It is a measure of the strength
of your model equation, or how tightly your points are spread around your
equation. In simple regressions where you have one X and one Y, the r is
supposed to be the same sign (positive or negative) as the covariance and the
slope. Excel always calculates what it calls "Multiple R" as a positive
number, so you may need to change the sign to a negative to make it the real
"r" correlation. If r = 1 or -1, then your XY points fall perfectly on your
model equation, there is no variation, and X and Y are perfectly related to
each other. If r = 0, then your points are random and X and Y are unrelated.
The larger the magnitude of r, the better the equation and the tighter the
points are spread around the model.

That's everything you would need to do to have a truly outstanding data
analysis! It's a great project idea and it's set up perfectly for some really
nice statistics. If you can figure out how to do all of the statistics I
described, then you should do fairly well in your science fair judging! Good
luck and ask for help if you get stuck!

END