Test your understanding: Measures of Central Tendency and Variation
Understanding measures of central tendency and variation (or
dispersion) are key to analyzing quantitative data because they enable us to
summarize large quantities of data.
Imagine you are asked to describe this data: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
What can you say about this data? How it be summarized? By the end of this post, you’ll be able to
answer these questions in several ways.
Measures of Central
Tendency: measures of tendency tell us what our values tend to be on
average. There are three: 1) mode, 2) median, and 3) mean.
The mode is the
most frequently occurring case. It is
the most general and least precise measure of tendency; it can be used with any
level of measurement. A sample of smokers
reveals that 10 smoke Camels, 15 smoke du Maurier, and smoke Dunhill. The mode is Dunhill (not 20) because it
occurs most frequently. If 10 students
scored 60 on the midterm, 15 scored 70, and 5 scored 80, then the mode is 70.
The median is the
midpoint in a data set. Once you have
rank ordered the cases, the middle value is the median. It can only be determined for ordinal and
interval/ratio levels (nominal data cannot be rank ordered). For example, 5 is the median of these data:
3,3,4,4,5,6,6,7,7 (if there were an even number of values, then the median
would be the average the middle two).
The mean is the
average value. It is determined by
adding up the individual cases and dividing by the total. 5 is also the mean for the above data. It can only be calculated for interval/ratio
data (nominal and ordinal data lack equal distance between categories).
The advantage of the mean is that it’s technically the most
accurate estimation of data because it takes into account every value, unlike
the median that only represents a single value.
The disadvantage is that the mean is sensitive to outliers (extreme high
or low values) and can be misleading, whereas the median is unaffected by
outliers. As a rule of thumb, it is
better to rely on the median if you know or suspect outliers in the data.
Measures of tendency are easy enough to figure out, but
they’re of limited use because most values vary to some degree from the mode,
median, and mean. For example, the mean
of 9 and 11 is 10, but so is the mean of 0 and 20. Here’s a better visual example:
Data set 1: 1, 1, 2,
5, 8, 9, 9
Data set 2: 2, 4, 4,
5, 6, 6, 7
5 is the mean value of both data sets. But as you can see, there’s a lot more
variation in the first set than in the second and so the mean actually better
reflects the second data set. These data
sets are small and easy to grasp visually, but imagine if you were dealing with
thousands of values! Then it wouldn’t be
so easy. Fortunately we have measures of
variation to tell us how spread out are the data.
Measures of Variation:
There are several measures of variation but we’re focusing
on three: the range, interquartile range (IQR), and the standard deviation
(SD).
The range is easy
enough—it’s the span of values listed lowest to highest. The range for the first data set is 1-9 and
the second is 2-7. It’s a vague measure,
but nonetheless gives us some sense that the values in the first data set are
more spread out than in the second.
The interquartile
range (IQR) tells us how closely the data are dispersed around the
median. There’s a formula for
calculating IQR, but the short and sweet of it is this: First, order the data least to highest, next
identify the median, then divide the data into four sections (or quartiles),
and finally drop the first and four quartiles (the highest and lowest
values). What remains are the second and
third quartiles separated by the median.
The IQR tells us how closely dispersed half our data are around the
median.
Consider data set 3:
1, 2, 3, 4, 5, 6, 7, 8
The data are ordered, the median is 4.5. We drop the first quartile (1, 2) and the
fourth quartile (7, 8), and we’re left with an IQR of 3-6. Now we know our median value is 4.5 and that
the middle half of the data are between 3-6.
The standard deviation
(SD) tells us how closely the data are dispersed around the mean. It is the most accurate measure of
variation. There’s a formula for
calculating SD that is relatively straightforward but involves several
steps. I’m only going to talk conceptually
about SD, so you’ll need to refer to the text for learning how to calculate it.
As with the mean, the benefit of SD is that it incorporates
all values into the calculation and therefore is more representative of the
entire data set; the disadvantage is that, like the mean, SD is also sensitive
to outliers.
Let’s say that our mean value is 20 and that we know a
single value in our data, 17. How close
is this value to the mean compared to the rest of the values in the data
set? 17 seems close to 20, but if most
of our data points are 18 and 19, then 17 is actually far from the mean
compared the majority. Fortunately, SD
comes to our rescue!
If SD=5, then we know that on average the values in our data
set are within 5 points of the mean. In
this case, we would conclude that a value of 17 is close to the mean of
20. By contrast, if SD=1 then we know
that on average values are within 1 point of the mean. In this case, we’d conclude that 17 is far
from the mean and therefore not representative of most values.
On a final note, SD is related to the idea of the “normal”
or bell curve, where the mean is at the center of the distribution and the SD
represents sections of data to either side.
Data are usually located within three SDs above and below the mean. For example, if SD=5 with a mean of 20, then
we know that the first SD is between 15-25, the second SD is between 10-30, and
the third is between 5-35. In this case,
a value of 17 would be within 1 SD of the mean whereas a value of 7 would only
be within the third SD. And “outlier” is
any value beyond three SDs of the mean.
<< Home