Chapter 11 Summary
As you know, Wednesday’s lecture was interrupted by a fire
drill. Ugh, how inconvenient. We must, however, move forward so below I’ve
provide a conceptual synopsis of Chapter 11.
============
Up to now we’ve looked at methods of collecting quantitative
(numeric) data. Surveys, experiments,
and content analysis are all methods that generate this data. And we looked at different ways to create
measures for this type of data, namely nominal, ordinal, and ratio levels.
Now we’ve come to Chapter 11. This chapter is about making sense of the
numeric data that we collect. It teaches
us how to describe data generally instead of on an individual basis. We do this all the time in our lives. For example, suppose you’re asked to describe
how good of a student you are. You
wouldn’t list off every single grade you’ve ever received in school. That would be ridiculous. Instead, you’d give some kind of summary
statement, like describing yourself as an “A” or “B” student, or by giving your
overall GPA. This idea of descriptive
summary of data is the focus of Chapter 11.
First, think of data
as a group of information about something.
If I wanted to estimate the age of 2030 students, then I’d ask a bunch
of people, not just one. And that bunch
of people that I asked would provide me with a pool of information to work
with, or a “dataset.” Once we have data,
then we have to make sense of it.
Second, groups of data can be analyzed individually or they
can be compared to other groups. Univariate analysis examines a single
group of data that are represented as one variable. For instance, age is a variable that we can
analyze all by itself (e.g., the average age of students in Soci 2030). Or, we can use bivariate analysis to compare age with another variable, like
gender (e.g., the average age of women in 2030 compared to the average of
men). In this post, I’m going to deal
only with univariate analysis.
Third, we talk about data two ways: 1) measures of central tendency and 2)
measures of variation (or dispersion). Central tendency is simply what data
points “tend” to be, or their average.
For example, the average age of students in 2030 is 21. However, that’s just a tendency; in fact,
many students are not 21. Measures of dispersion tell us how
closely most of the cases in our data are to the average. For example, most students in 2030 are within
1 year plus or minus of 21. This gives
us two related pieces of information: we know that on average 2030 students are
21 years old and we know that most of those who are not 21 are still between
20-22. In this example, the average and
the dispersion are pretty close. But
this is not always the case.
Here’s a different example.
Consider the age difference between students registered at York and students
at Atkinson. York is set up on the
traditional university model and therefore students tend to be quite
young. York students are usually fresh
out of high school, live at home or in residence, receive partial or full
financial support from parents, and have limited professional work
experience. Not surprisingly, York
students tend to be about 20 years old on average, give or take two years. There’s not a lot of variation in age.
Now contrast York students with Atkinson students. Atkinson was originally set up to accommodate
mature students. Mature students are
older on average for a variety of reasons.
Maybe someone worked for a few years after high school and then decided
to go to university. Or, maybe they
worked for 20 years, realized that they couldn’t get promoted further without
more education, and are now attending night school to earn extra credits. Or, maybe they worked their whole lives,
retired, and decided to go to school and earn that degree they always dreamed
of. In short, Atkinson students tend to
be older and are much more diverse in age.
The average age is 35, with most between 25-45 years. Compared to York students, Atkinson students
have a higher average age, plus there is also more variation from that average
(plus or minus 10 years).
In short, central tendency gives us a general idea about
what our data tend to be, while measures of dispersion help us understand how
much data vary from that tendency.
Fourth, now that we have the basics, we need to identify the
different measures of central tendency and variation. Measures of central tendency are the mode,
median, and mean. Measures of variation
are the range, percentiles, and standard deviation. The level of measurement of a variable
(nominal, ordinal, ratio) determines which measure of central tendency or
variation can be used.
In tutorial we collected data on the kind of drugs people
used over the weekend. This was a
nominal level of measurement and therefore we could only identify a mode, or
most commonly occurring value. Next, we looked at data on 15 quiz scores. This was a ratio level measurement and
therefore we were able to identify a mode, median and mean.
Measures of variation require an ordinal or ratio level
measure because the data must be ranked before we can describe how it varies
(remember nominal data cannot be ranked).
We’re only focusing on the range and standard deviation. The range
is a listing of data from smallest to largest value. For example, the range in age of 2030
students is 18-28 years old. This
doesn’t tell us much. We know the age of
the youngest and oldest students but we don’t know the variation of ages. For example, there could be nineteen 18-year
olds and one 28-year old or there could be 2 students who are 18, 19,
20…28. Either way the range would be the
same.
The standard deviation
(SD) is more precise and tells us how far from the mean that our data
vary. Because the mean can only be
calculated using ratio level data, this implies that SD also requires ratio
data. Remember that you will not need to
calculate SD. It is sufficient to simply
know that it tells us how varied are our data—the larger the SD, the more
variation in the data. One SD refers to
about 2/3 or 66% of our data, two SDs refers to about 95%, and three SDs refers
to about 99% of the data. For example,
the mean annual income in a dataset is $50,000/year with an SD of $5,000. One SD is plus or minus $5,000 of the average
value. This means that approximately 2/3
(or 66%) of people earn between $45-$55,000/year. Two SDs means that approximately 95% earn
between $40-$60K/year, and so forth.
Stay tuned for more!
<< Home