Saturday, October 25, 2014

Chapter 11 Summary

As you know, Wednesday’s lecture was interrupted by a fire drill. Ugh, how inconvenient. We must, however, move forward so below I’ve provide a conceptual synopsis of Chapter 11.

============

Up to now we’ve looked at methods of collecting quantitative (numeric) data.  Surveys, experiments, and content analysis are all methods that generate this data.  And we looked at different ways to create measures for this type of data, namely nominal, ordinal, and ratio levels.

Now we’ve come to Chapter 11.  This chapter is about making sense of the numeric data that we collect.  It teaches us how to describe data generally instead of on an individual basis.  We do this all the time in our lives.  For example, suppose you’re asked to describe how good of a student you are.  You wouldn’t list off every single grade you’ve ever received in school.  That would be ridiculous.  Instead, you’d give some kind of summary statement, like describing yourself as an “A” or “B” student, or by giving your overall GPA.  This idea of descriptive summary of data is the focus of Chapter 11.

First, think of data as a group of information about something.  If I wanted to estimate the age of 2030 students, then I’d ask a bunch of people, not just one.  And that bunch of people that I asked would provide me with a pool of information to work with, or a “dataset.”  Once we have data, then we have to make sense of it.

Second, groups of data can be analyzed individually or they can be compared to other groups.  Univariate analysis examines a single group of data that are represented as one variable.  For instance, age is a variable that we can analyze all by itself (e.g., the average age of students in Soci 2030).  Or, we can use bivariate analysis to compare age with another variable, like gender (e.g., the average age of women in 2030 compared to the average of men).  In this post, I’m going to deal only with univariate analysis.

Third, we talk about data two ways:  1) measures of central tendency and 2) measures of variation (or dispersion).  Central tendency is simply what data points “tend” to be, or their average.  For example, the average age of students in 2030 is 21.  However, that’s just a tendency; in fact, many students are not 21.  Measures of dispersion tell us how closely most of the cases in our data are to the average.  For example, most students in 2030 are within 1 year plus or minus of 21.  This gives us two related pieces of information: we know that on average 2030 students are 21 years old and we know that most of those who are not 21 are still between 20-22.  In this example, the average and the dispersion are pretty close.  But this is not always the case.

Here’s a different example.  Consider the age difference between students registered at York and students at Atkinson.  York is set up on the traditional university model and therefore students tend to be quite young.  York students are usually fresh out of high school, live at home or in residence, receive partial or full financial support from parents, and have limited professional work experience.  Not surprisingly, York students tend to be about 20 years old on average, give or take two years.  There’s not a lot of variation in age.

Now contrast York students with Atkinson students.  Atkinson was originally set up to accommodate mature students.  Mature students are older on average for a variety of reasons.  Maybe someone worked for a few years after high school and then decided to go to university.  Or, maybe they worked for 20 years, realized that they couldn’t get promoted further without more education, and are now attending night school to earn extra credits.  Or, maybe they worked their whole lives, retired, and decided to go to school and earn that degree they always dreamed of.  In short, Atkinson students tend to be older and are much more diverse in age.  The average age is 35, with most between 25-45 years.  Compared to York students, Atkinson students have a higher average age, plus there is also more variation from that average (plus or minus 10 years).

In short, central tendency gives us a general idea about what our data tend to be, while measures of dispersion help us understand how much data vary from that tendency.

Fourth, now that we have the basics, we need to identify the different measures of central tendency and variation.  Measures of central tendency are the mode, median, and mean.  Measures of variation are the range, percentiles, and standard deviation.  The level of measurement of a variable (nominal, ordinal, ratio) determines which measure of central tendency or variation can be used.

In tutorial we collected data on the kind of drugs people used over the weekend.  This was a nominal level of measurement and therefore we could only identify a mode, or most commonly occurring value. Next, we looked at data on 15 quiz scores.  This was a ratio level measurement and therefore we were able to identify a mode, median and mean.

Measures of variation require an ordinal or ratio level measure because the data must be ranked before we can describe how it varies (remember nominal data cannot be ranked).  We’re only focusing on the range and standard deviation.  The range is a listing of data from smallest to largest value.  For example, the range in age of 2030 students is 18-28 years old.  This doesn’t tell us much.  We know the age of the youngest and oldest students but we don’t know the variation of ages.  For example, there could be nineteen 18-year olds and one 28-year old or there could be 2 students who are 18, 19, 20…28.  Either way the range would be the same.

The standard deviation (SD) is more precise and tells us how far from the mean that our data vary.  Because the mean can only be calculated using ratio level data, this implies that SD also requires ratio data.  Remember that you will not need to calculate SD.  It is sufficient to simply know that it tells us how varied are our data—the larger the SD, the more variation in the data.  One SD refers to about 2/3 or 66% of our data, two SDs refers to about 95%, and three SDs refers to about 99% of the data.  For example, the mean annual income in a dataset is $50,000/year with an SD of $5,000.  One SD is plus or minus $5,000 of the average value.  This means that approximately 2/3 (or 66%) of people earn between $45-$55,000/year.  Two SDs means that approximately 95% earn between $40-$60K/year, and so forth.


Stay tuned for more!