|
September 23, 2004
A: FROM MENTOR DENISE HARBERT
IN IL
Very awesome question Josie!!! I think this might be the first
very
computational statistical question that has appeared on the
Daily Digest! I
am a statistician and am very excited to help!
************
I DID NOT FINISH MY ANSWER TODAY BECAUSE I NEED YOU TO ANSWER
3 QUESTIONS FOR
ME FIRST - READ BELOW!!!
************
Let me start by helping you a little with your terminology.
Research projects
are usually described as either "studies" or "experiments".
(Incidentally,
there are other categories, such as participatory action research,
but most
research is done in a study or experiment.) Although I can't
tell for sure
from your question, I think you are doing a study. Studies
are where the
researcher goes out into the world and takes measurements
of something that
already exists in nature, then tries to describe what's happening.
A
meteorologist who goes out into the weather each day and measures
dew point,
relative humidity, temperature, pressure, etc. (these are
called "variables")
is conducting a study. An experiment is where the researcher
changes
something in the natural environment, takes measurements,
then determines if
the results were different than they would have been without
the change. If
there are differences, then the researcher can conclude that
the differences
were caused by the thing that the researcher changed. A meteorologist
working
in a weather lab creates fake weather inside a sealed containment
unit,
changes the temperature inside the unit but nothing else,
then measures the
pressure, dew point, etc. If any variable other than temperature
changed,
then the meteorologist can conclude that the change in temperature
caused the
change in the other variable(s).
Studies and experiments differ mainly in what their purpose
is, so the
conclusions that can be drawn based on the results of the
research differ as
well. In short, studies help researchers DESCRIBE relationships
between
variables, but experiments help researchers determine whether
one variable is
CAUSING the others to be what they are. Depending on the experimental
design,
experiments might also describe the relationship between variables.
However,
studies never determine cause. They might suggest cause, but
they do not
determine whether cause exists. For example, profits from
ice cream sales in
the midwest U.S. are related to the number of drowning deaths.
Do ice cream
sales cause drownings? Do drownings cause people to eat ice
cream? Or does
the warm or cold weather cause people to swim or not swim,
eat ice cream or
not eat ice cream? This is a difficult concept for a lot of
really brilliant
researchers to understand, so don't feel bad if you don't
get it right away.
(See my answer to the question "What is calculus? What
is statistics?" in the
March 12, 2002 Daily Digest at http://www.uic.edu/orgs/gem-set/march.htm
for a
more thorough explanation. In the interrupting research example,
the first
researcher did a study to confirm the interrupting relationship
existed and
the second researcher did an experiment to determine the cause
of the
interrupting.)
"Correlation" is another word to be careful with
in statistics because it
means something entirely different to statisticians than it
does to
non-statisticians. Your question should have said "relationship"
instead of
correlation. Pearson's correlation coefficient (correlation
for short, also
called "r") is a very specific "test statistic"
that is computed with a
mathematical formula. Chi-squared is also a "test statistic",
although it has
a different equation and is used with completely different
types of data. The
correlation represents the strength of the linear relationship
between
variables. I'll explain more on that later...
As a statistician, I always advise researchers that it is
CRUCIAL to figure
out how they're going to analyze their data BEFORE they design
a study or
experiment. This seems counter-intuitive and backwards, but
it's critical to
making sure data get collected in such a way that statistical
analysis can be
done. I have seen and heard about many expensive research
projects that have
been a complete waste of time and money because the data were
collected in
such a way that it is impossible to do any meaningful statistical
analysis.
Fortunately, it seems like your project is automatically well
designed for a
standard statistical test (unlike some of the professional
research failures
I've seen)!
The first thing you need to figure out when thinking of statistical
analysis
is what kind of variables you are measuring. Are they binary?
categorical?
discrete? continuous? somewhere in the middle?
Binary variables have two possible answers. Any variable that
can only be
"Yes" or "No" is binary. So is "Up"
or "Down" and "Pass" or "Fail".
(Incidentally, computer software engineers initially based
all of their
foundation programs on binary code which used a "0"
or "1". Most programmers
today write code in non-binary languages that are based on
binary code.)
Categorical variables have three or more possible answers
that cannot be
arranged from low to high in any meaningful order. Race is
a categorical
variable. So is college major, favorite color, phone number,
zip code,
favorite sport, extra curricular activities, etc. Some categorical
variables
have numbers assigned to them (e.g., area codes, zip codes),
but the numbers
are not meaningful if you average them or put them in order
from smallest to
largest. Thinking about whether an average makes sense is
a quick way to tell
if a numbered variable is actually categorical.
Discrete variables have three or more possible answers that
can be
meaningfully ordered from low to high, but they have gaps
in between them and
they may not be evenly spaced. A variable whose answer can
be "None", "Some",
"Most", and "All" is discrete. So is "1"
"2" "3", "4 or more" and "Small",
"Medium", and "Large". So is a "Strongly
Disagree", "Disagree", "Neutral",
"Agree", "Strongly Agree". (Incidentally,
this last one is called a Likert
scale.)
Continuous variables are those that can be put in order from
low to high and
have no gaps in between. In theory, they could be measured
to an infinitely
small decimal point if you had the proper measurement tools.
Height, weight,
time, and distance are all continuous.
Some variables are actually discrete, but there are so many
possible answers
that they can be considered continuous for data analysis purposes.
Money is
one example. There is nothing smaller than a penny, so money
is discrete.
But think of how many numbers exist between $0 and $100 (that's
10,000
possible answers). "How old are you?" is realistically
discrete because our
society understands that to mean age in years, rounded down
to the nearest
whole number. However, in some cases it may make sense to
analyze age as
continuous.
Here are some questions for you:
1. Is relative humidity binary, categorical, discrete, or
continuous?
2. Is elevation binary, categorical, discrete, or continuous?
3. Do you have access to some kind of computer software for
statistical data
analysis? If so, what is it called? If not, then do you have
access to
Microsoft Excel?
I have coincidentally run out of time! But I'll make a deal
with you. If you
write to GEM-SET@uic.edu and answer the 3 questions above,
then I will give
you information about the type of statistical test that would
be best to use
for relative humidity and elevation!!
|