GEM-SET : Girls' E-Mentoring Program : Science | Engineering | Technology
Home
Welcome
Mentors
Partners
Calendar of Events
Daily Digest
Contacts
SET Links
FAQs
Daily Digest Archive

Daily Digest Archive for September 23, 2004

Q: (Initially posted September 20, 2004) FROM STUDENT MEMBER JOSIE K. IN GA
I am doing a science fair experiment on the correlation between relative
humidity and elevation. Does anyone have any suggestions on the statistical
analysis that I will have to make? I know I have to do standard deviation
and chi-squared, however I am not sure where to go from there ( I have not
taken statistics yet) but I would be totally open to any suggestions. Thanks
again in advance!!

September 23, 2004

A: FROM MENTOR DENISE HARBERT IN IL
Very awesome question Josie!!! I think this might be the first very
computational statistical question that has appeared on the Daily Digest! I
am a statistician and am very excited to help!

************
I DID NOT FINISH MY ANSWER TODAY BECAUSE I NEED YOU TO ANSWER 3 QUESTIONS FOR
ME FIRST - READ BELOW!!!
************

Let me start by helping you a little with your terminology. Research projects
are usually described as either "studies" or "experiments". (Incidentally,
there are other categories, such as participatory action research, but most
research is done in a study or experiment.) Although I can't tell for sure
from your question, I think you are doing a study. Studies are where the
researcher goes out into the world and takes measurements of something that
already exists in nature, then tries to describe what's happening. A
meteorologist who goes out into the weather each day and measures dew point,
relative humidity, temperature, pressure, etc. (these are called "variables")
is conducting a study. An experiment is where the researcher changes
something in the natural environment, takes measurements, then determines if
the results were different than they would have been without the change. If
there are differences, then the researcher can conclude that the differences
were caused by the thing that the researcher changed. A meteorologist working
in a weather lab creates fake weather inside a sealed containment unit,
changes the temperature inside the unit but nothing else, then measures the
pressure, dew point, etc. If any variable other than temperature changed,
then the meteorologist can conclude that the change in temperature caused the
change in the other variable(s).

Studies and experiments differ mainly in what their purpose is, so the
conclusions that can be drawn based on the results of the research differ as
well. In short, studies help researchers DESCRIBE relationships between
variables, but experiments help researchers determine whether one variable is
CAUSING the others to be what they are. Depending on the experimental design,
experiments might also describe the relationship between variables. However,
studies never determine cause. They might suggest cause, but they do not
determine whether cause exists. For example, profits from ice cream sales in
the midwest U.S. are related to the number of drowning deaths. Do ice cream
sales cause drownings? Do drownings cause people to eat ice cream? Or does
the warm or cold weather cause people to swim or not swim, eat ice cream or
not eat ice cream? This is a difficult concept for a lot of really brilliant
researchers to understand, so don't feel bad if you don't get it right away.
(See my answer to the question "What is calculus? What is statistics?" in the
March 12, 2002 Daily Digest at http://www.uic.edu/orgs/gem-set/march.htm for a
more thorough explanation. In the interrupting research example, the first
researcher did a study to confirm the interrupting relationship existed and
the second researcher did an experiment to determine the cause of the
interrupting.)

"Correlation" is another word to be careful with in statistics because it
means something entirely different to statisticians than it does to
non-statisticians. Your question should have said "relationship" instead of
correlation. Pearson's correlation coefficient (correlation for short, also
called "r") is a very specific "test statistic" that is computed with a
mathematical formula. Chi-squared is also a "test statistic", although it has
a different equation and is used with completely different types of data. The
correlation represents the strength of the linear relationship between
variables. I'll explain more on that later...

As a statistician, I always advise researchers that it is CRUCIAL to figure
out how they're going to analyze their data BEFORE they design a study or
experiment. This seems counter-intuitive and backwards, but it's critical to
making sure data get collected in such a way that statistical analysis can be
done. I have seen and heard about many expensive research projects that have
been a complete waste of time and money because the data were collected in
such a way that it is impossible to do any meaningful statistical analysis.
Fortunately, it seems like your project is automatically well designed for a
standard statistical test (unlike some of the professional research failures
I've seen)!

The first thing you need to figure out when thinking of statistical analysis
is what kind of variables you are measuring. Are they binary? categorical?
discrete? continuous? somewhere in the middle?

Binary variables have two possible answers. Any variable that can only be
"Yes" or "No" is binary. So is "Up" or "Down" and "Pass" or "Fail".
(Incidentally, computer software engineers initially based all of their
foundation programs on binary code which used a "0" or "1". Most programmers
today write code in non-binary languages that are based on binary code.)

Categorical variables have three or more possible answers that cannot be
arranged from low to high in any meaningful order. Race is a categorical
variable. So is college major, favorite color, phone number, zip code,
favorite sport, extra curricular activities, etc. Some categorical variables
have numbers assigned to them (e.g., area codes, zip codes), but the numbers
are not meaningful if you average them or put them in order from smallest to
largest. Thinking about whether an average makes sense is a quick way to tell
if a numbered variable is actually categorical.

Discrete variables have three or more possible answers that can be
meaningfully ordered from low to high, but they have gaps in between them and
they may not be evenly spaced. A variable whose answer can be "None", "Some",
"Most", and "All" is discrete. So is "1" "2" "3", "4 or more" and "Small",
"Medium", and "Large". So is a "Strongly Disagree", "Disagree", "Neutral",
"Agree", "Strongly Agree". (Incidentally, this last one is called a Likert
scale.)

Continuous variables are those that can be put in order from low to high and
have no gaps in between. In theory, they could be measured to an infinitely
small decimal point if you had the proper measurement tools. Height, weight,
time, and distance are all continuous.

Some variables are actually discrete, but there are so many possible answers
that they can be considered continuous for data analysis purposes. Money is
one example. There is nothing smaller than a penny, so money is discrete.
But think of how many numbers exist between $0 and $100 (that's 10,000
possible answers). "How old are you?" is realistically discrete because our
society understands that to mean age in years, rounded down to the nearest
whole number. However, in some cases it may make sense to analyze age as
continuous.

Here are some questions for you:
1. Is relative humidity binary, categorical, discrete, or continuous?
2. Is elevation binary, categorical, discrete, or continuous?
3. Do you have access to some kind of computer software for statistical data
analysis? If so, what is it called? If not, then do you have access to
Microsoft Excel?

I have coincidentally run out of time! But I'll make a deal with you. If you
write to GEM-SET@uic.edu and answer the 3 questions above, then I will give
you information about the type of statistical test that would be best to use
for relative humidity and elevation!!

 

END