Saturday, October 25, 2014

Test your understanding: Measures of Central Tendency and Variation

Understanding measures of central tendency and variation (or dispersion) are key to analyzing quantitative data because they enable us to summarize large quantities of data.

Imagine you are asked to describe this data:  0, 1, 2, 3, 4, 5, 6, 7, 8, 9

What can you say about this data?  How it be summarized?  By the end of this post, you’ll be able to answer these questions in several ways.

Measures of Central Tendency: measures of tendency tell us what our values tend to be on average.  There are three: 1) mode, 2) median, and 3) mean.

The mode is the most frequently occurring case.  It is the most general and least precise measure of tendency; it can be used with any level of measurement.  A sample of smokers reveals that 10 smoke Camels, 15 smoke du Maurier, and smoke Dunhill.  The mode is Dunhill (not 20) because it occurs most frequently.  If 10 students scored 60 on the midterm, 15 scored 70, and 5 scored 80, then the mode is 70.

The median is the midpoint in a data set.  Once you have rank ordered the cases, the middle value is the median.  It can only be determined for ordinal and interval/ratio levels (nominal data cannot be rank ordered).  For example, 5 is the median of these data: 3,3,4,4,5,6,6,7,7 (if there were an even number of values, then the median would be the average the middle two).

The mean is the average value.  It is determined by adding up the individual cases and dividing by the total.  5 is also the mean for the above data.  It can only be calculated for interval/ratio data (nominal and ordinal data lack equal distance between categories).

The advantage of the mean is that it’s technically the most accurate estimation of data because it takes into account every value, unlike the median that only represents a single value.  The disadvantage is that the mean is sensitive to outliers (extreme high or low values) and can be misleading, whereas the median is unaffected by outliers.  As a rule of thumb, it is better to rely on the median if you know or suspect outliers in the data.

Measures of tendency are easy enough to figure out, but they’re of limited use because most values vary to some degree from the mode, median, and mean.  For example, the mean of 9 and 11 is 10, but so is the mean of 0 and 20.  Here’s a better visual example:

Data set 1:  1, 1, 2, 5, 8, 9, 9
Data set 2:  2, 4, 4, 5, 6, 6, 7

5 is the mean value of both data sets.  But as you can see, there’s a lot more variation in the first set than in the second and so the mean actually better reflects the second data set.  These data sets are small and easy to grasp visually, but imagine if you were dealing with thousands of values!  Then it wouldn’t be so easy.  Fortunately we have measures of variation to tell us how spread out are the data.

Measures of Variation:
There are several measures of variation but we’re focusing on three: the range, interquartile range (IQR), and the standard deviation (SD).

The range is easy enough—it’s the span of values listed lowest to highest.  The range for the first data set is 1-9 and the second is 2-7.  It’s a vague measure, but nonetheless gives us some sense that the values in the first data set are more spread out than in the second.

The interquartile range (IQR) tells us how closely the data are dispersed around the median.  There’s a formula for calculating IQR, but the short and sweet of it is this:  First, order the data least to highest, next identify the median, then divide the data into four sections (or quartiles), and finally drop the first and four quartiles (the highest and lowest values).  What remains are the second and third quartiles separated by the median.  The IQR tells us how closely dispersed half our data are around the median.

Consider data set 3:  1, 2, 3, 4, 5, 6, 7, 8

The data are ordered, the median is 4.5.  We drop the first quartile (1, 2) and the fourth quartile (7, 8), and we’re left with an IQR of 3-6.  Now we know our median value is 4.5 and that the middle half of the data are between 3-6.

The standard deviation (SD) tells us how closely the data are dispersed around the mean.  It is the most accurate measure of variation.  There’s a formula for calculating SD that is relatively straightforward but involves several steps.  I’m only going to talk conceptually about SD, so you’ll need to refer to the text for learning how to calculate it.

As with the mean, the benefit of SD is that it incorporates all values into the calculation and therefore is more representative of the entire data set; the disadvantage is that, like the mean, SD is also sensitive to outliers.

Let’s say that our mean value is 20 and that we know a single value in our data, 17.  How close is this value to the mean compared to the rest of the values in the data set?  17 seems close to 20, but if most of our data points are 18 and 19, then 17 is actually far from the mean compared the majority.  Fortunately, SD comes to our rescue!

If SD=5, then we know that on average the values in our data set are within 5 points of the mean.  In this case, we would conclude that a value of 17 is close to the mean of 20.  By contrast, if SD=1 then we know that on average values are within 1 point of the mean.  In this case, we’d conclude that 17 is far from the mean and therefore not representative of most values. 

On a final note, SD is related to the idea of the “normal” or bell curve, where the mean is at the center of the distribution and the SD represents sections of data to either side.  Data are usually located within three SDs above and below the mean.  For example, if SD=5 with a mean of 20, then we know that the first SD is between 15-25, the second SD is between 10-30, and the third is between 5-35.  In this case, a value of 17 would be within 1 SD of the mean whereas a value of 7 would only be within the third SD.  And “outlier” is any value beyond three SDs of the mean.