{"id":3295,"date":"2016-11-01T04:03:11","date_gmt":"2016-11-01T04:03:11","guid":{"rendered":"http:\/\/course.oeru.org\/research-methods\/?page_id=3295"},"modified":"2016-11-01T04:03:11","modified_gmt":"2016-11-01T04:03:11","slug":"descriptive-statistics","status":"publish","type":"page","link":"https:\/\/course.oeru.org\/research-methods\/modules-4-6\/module-4-quantitative-methods\/descriptive-statistics\/","title":{"rendered":"Descriptive Statistics"},"content":{"rendered":"<div id=\"content\" class=\"mw-body container\" role=\"main\">\n<div class=\"row\">\n<div class=\"col-md-12\">\n<div class=\"panel\">\n<div class=\"panel-body\">\n<div id=\"bodyContent\">\n<div id=\"mw-content-text\" lang=\"en\" dir=\"ltr\" class=\"mw-content-ltr\">\n<p>Before we begin to analyse statistical data, we need to get comfortable with it.  So at first, simply to describe the distribution of one variable at a time.  This is also called <i>univariate analysis<\/i>.\n<\/p>\n<h1><span class=\"mw-headline\" id=\"Central_Tendency\">Central Tendency<\/span><\/h1>\n<p>The central tendency of a variable is nothing more than its average value.  There are, however, different kinds of averages: the mean, the median, and the mode.\n<\/p>\n<h1><span class=\"mw-headline\" id=\"The_Mean\">The Mean<\/span><\/h1>\n<p>Also known as the \u201carithmetic mean,\u201d this is probably what most of us think of when we use the term &#8220;average.&#8221;  The mean value of a variable is calculated simply by adding up all the values and then dividing by the number of cases.  The mean requires that data be at least interval <sup id=\"cite_ref-1\" class=\"reference\"><a href=\"#cite_note-1\">[1]<\/a><\/sup> level.  Take, for example, a variable with three cases having the values 2, 3, and 4 respectively.  The mean for this variable would obviously be 3 (2+3+4=9; 9\/3=3), but this makes no sense unless the difference between 2 and 3 is the same as the difference between 3 and 4.\n<\/p>\n<p>In mathematical notation, the mean for population data is represented by the symbol \u03bc (the lower case Greek letter mu). If we are using sample data, the mean (of a variable named X) is represented by the symbol <span class=\"mathjax-wrapper\">\\( \\bar{X} \\)<\/span> (pronounced \u201cX bar\u201d).\n<\/p>\n<h1><span class=\"mw-headline\" id=\"The_Median\">The Median<\/span><\/h1>\n<p>The median is calculated by ranking cases from high to low (or vice-versa) and then finding the value of the case that is in the middle (also called the 50th percentile) of the distribution. By definition, half of all cases are at or above the median, and half below.  In a distribution of 21 cases, for example, the median value is the value of the 11th highest case, since there are 10 cases with higher values and 10 cases with lower values.  If there is an even number of cases, the median value is the value half way between the values of the two cases closest to the middle.  For example, in a distribution of 20 cases, the median value is half way between the values of the 10th and 11th highest cases.\n<\/p>\n<p>The notion of a \u201cmiddle\u201d case makes sense only if cases can be rank-ordered. Calculation of a median, therefore, requires at least <a href=\"\/ResearchMethods\/QuantMix\/Measurements#Ordinal_Data\" title=\"ResearchMethods\/QuantMix\/Measurements\">ordinal<\/a> level data.  Sometimes, it makes sense to calculate a median instead of, or in addition to, a mean even with interval or ratio data.  If the distribution of the values of a variable is heavily \u201cskewed\u201d by a few very high or very low scores, the mean of the distribution will be misleading.  Suppose, for example, that there are 100 households in your neighborhood, and that both their mean and the median household incomes are about $50,000 per year.  Now suppose that Bill Gates and his family move in next door.  The median household income will not change much (now that the neighborhood contains 101 families, it will be the income of the family ranked 51st), but the mean household income will be in the hundreds of millions of dollars.  Which figure better describes the \u201caverage\u201d family in the neighborhood?\n<\/p>\n<h1><span class=\"mw-headline\" id=\"The_Mode\">The Mode<\/span><\/h1>\n<p>The mode of a variable is the value that occurs most frequently.  In the Australia, the modal ancestry is English (36.1%) and the modal gender is female (100 females for every 99.1 males)<a rel=\"nofollow\" class=\"external autonumber\" href=\"http:\/\/www.abs.gov.au\/ausstats\/abs@.nsf\/mf\/3235.0\">[1]<\/a>.  In an election, the modal candidate is the one who receives more votes than anyone else.  In politics, a mode is often referred to as a &#8220;plurality.&#8221;  It can be used with any level of measurement.\n<\/p>\n<p>Sometimes the question of which measure of central tendency is used can be a hot political topic.   For example, measuring economic prosperity by looking at the mean wage can be heavily skewed by high wages of executives, whereas modal wages provide a more realistic picture of the situation of the working class.\n<\/p>\n<h1><span class=\"mw-headline\" id=\"Dispersion\">Dispersion<\/span><\/h1>\n<p>In addition to the average value of a variable, we also want to know how spread out the values are: their <b>dispersion<\/b>.  The range (the difference between the maximum and minimum values) gives an indication, but it is only a very limited indication.  There are some other, more useful, measures.\n<\/p>\n<h2><span class=\"mw-headline\" id=\"The_Variance_and_the_Standard_Deviation\">The Variance and the Standard Deviation<\/span><\/h2>\n<p>The <b>variance<\/b> and the <b>standard deviation<\/b> are related measures of how spread out the values of a variable are from the mean.  Just like the mean requires at least interval level measurement, so do the variance and the standard deviation.\n<\/p>\n<p>Let&#8217;s have a look at an example.  Have a look at the two sets of numbers shown below.  Both have the same mean (10), but the numbers on the right are clearly more spread out than those on the left.\n<\/p>\n<table class=\"oeru1 table table-striped\">\n<tbody>\n<tr>\n<th colspan=\"2\" style=\"text-align: center;\"> Table 1: Example of variance difference\n<\/th>\n<\/tr>\n<tr>\n<th>Set 1\n<\/th>\n<th>Set 2\n<\/th>\n<\/tr>\n<tr>\n<td> 12\n<\/td>\n<td> 14\n<\/td>\n<\/tr>\n<tr>\n<td> 11\n<\/td>\n<td> 12\n<\/td>\n<\/tr>\n<tr>\n<td> 10\n<\/td>\n<td> 10\n<\/td>\n<\/tr>\n<tr>\n<td> 9\n<\/td>\n<td> 8\n<\/td>\n<\/tr>\n<tr>\n<td> 8\n<\/td>\n<td> 6\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Table 1 shows two examples of variance.  Table 2 shows an example of how the variance in the group of numbers on the left is calculated.  In the first column, the individual values of the variable (which we will represent with the symbol \u201cXi\u201d) are listed.  In the second column, the \u201cdeviation\u201d from the mean value (here we&#8217;ll use the symbol for the population mean, or \u00b5) of 10 is subtracted from each value.  If we simply took an average of the deviations, the result would always be zero.  Instead, in the third column we square the deviations from the mean.  Finally we sum (<span class=\"mathjax-wrapper\">\\(\\sum\\)<\/span>, the upper-case Greek letter sigma) these individual numbers from the first through the last, or n<sup>th<\/sup> <span class=\"mathjax-wrapper\">\\( (\\sum_{i=1}^N ) \\)<\/span> and divide by the number of cases (5).  The result is the \u201cmean squared deviation from the mean,\u201d or the variance.  For population data, the variance <sup id=\"cite_ref-2\" class=\"reference\"><a href=\"#cite_note-2\">[2]<\/a><\/sup> is represented by the symbol \u03c3<sup>2<\/sup> (the square of the lower-case Greek letter sigma) for population data, and s<sup>2<\/sup> for sample data.\n<\/p>\n<table class=\"oeru1 table table-striped\">\n<tbody>\n<tr>\n<th colspan=\"3\" style=\"text-align: center;\"> Table 2: Calculating variance\n<\/th>\n<\/tr>\n<tr>\n<th>X<sub>i<\/sub>\n<\/th>\n<th>X<sub>i<\/sub> &#8211; <span class=\"mathjax-wrapper\">\\( \u03bc \\)<\/span>\n<\/th>\n<th>(X<sub>i<\/sub> &#8211; <span class=\"mathjax-wrapper\">\\( \u03bc \\)<\/span>)<sup>2<\/sup>\n<\/th>\n<\/tr>\n<tr>\n<td> 12\n<\/td>\n<td> 2\n<\/td>\n<td> 4\n<\/td>\n<\/tr>\n<tr>\n<td> 11\n<\/td>\n<td> 1\n<\/td>\n<td> 1\n<\/td>\n<\/tr>\n<tr>\n<td> 10\n<\/td>\n<td> 0\n<\/td>\n<td> 0\n<\/td>\n<\/tr>\n<tr>\n<td> 9\n<\/td>\n<td> -1\n<\/td>\n<td> -1\n<\/td>\n<\/tr>\n<tr>\n<td> 8\n<\/td>\n<td> -2\n<\/td>\n<td> 4\n<\/td>\n<\/tr>\n<tr>\n<td colspan=\"3\"> <span class=\"mathjax-wrapper\">\\(\\dfrac {\\sum_{i=1}^N (X_{i} &#8211; \u03bc)^2}{N} = 10\/5 = 2 \\)<\/span>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The standard deviation (\u03c3 for population data, s for sample data), like the variance, is a measure of dispersion, and is the one usually reported.  It is simply the positive square root of the variance.  In the above example, <span class=\"mathjax-wrapper\">\\( \u03c3 = \\sqrt {2} = 1.41  \\)<\/span>\n<\/p>\n<p>The variance and the standard deviation are usually not of great interest in and of themselves.  They are, however, central to a wide variety of other statistical methods.  Occasionally, they do have direct application.  Beck, for example, demonstrates the nationalization of American politics during the Twentieth Century by showing that the standard deviation in presidential vote by state declined fairly steadily between 1896 and 1992. <sup id=\"cite_ref-3\" class=\"reference\"><a href=\"#cite_note-3\">[3]<\/a><\/sup>\n<\/p>\n<p>If you want to learn more about standard variation, check out <a rel=\"nofollow\" class=\"external text\" href=\"https:\/\/www.khanacademy.org\/math\/probability\/descriptive-statistics\/variance-std-deviation\/v\/statistics-standard-deviation\">this video<\/a> from Khan Academy.\n<\/p>\n<h1><span class=\"mw-headline\" id=\"Boxplots\">Boxplots<\/span><\/h1>\n<p>A boxplot (also known as a box and whiskers plot) is another way of examining the distribution of a continuous variable.<br \/>\nFigure A shows a boxplot for educational expenditures as a percent of Gross Domestic Product (GDP) of various countries.\n<\/p>\n<div class=\"thumb tright\">\n<div class=\"thumbinner thumbnail\" style=\"width:182px;\"><a href=\"http:\/\/WikiEducator.org\/File:FigureA-boxplots1.gif\" class=\"image\"><img loading=\"lazy\" decoding=\"async\" alt=\"\" src=\"\/\/WikiEducator.org\/images\/thumb\/0\/08\/FigureA-boxplots1.gif\/180px-FigureA-boxplots1.gif\" width=\"180\" height=\"144\" class=\"thumbimage img-responsive\"><\/a>  <\/p>\n<div class=\"thumbcaption\">Boxplot illustration for research methods subject<\/div>\n<\/div>\n<\/div>\n<p>The \u201cbox\u201d in the figure shows the &#8216;<i><b>interquartile range&#8217;<\/b><\/i>.  That is, the line at the top of the box represents the value of the 75<sup>th<\/sup> percentile, while the line at the bottom of the box represents the value of the 25<sup>th<\/sup> percentile.  In other words, the middle half of all counties are within the box.  The value of the 50<sup>th<\/sup> percentile (that is, of the median value) is represented by the horizontal line within the box.  The lines extending from the box are the \u201cwhiskers,\u201d and the horizontal lines at the end of the whiskers represent the highest and lowest values that are outside the box but within 1.5 times the inter-quartile range (1.5*IQR).    The circles beyond the whiskers represent \u201coutliers,\u201d that is, cases outside the box by more than 1.5*IQR, while asterisks represent \u201cextreme outliers,\u201d that is, those outside the box by more than 3*IQR. We&#8217;ll take up this subject again in the next chapter when we discuss the normal distribution. Note for now that there are several outliers and two extreme outliers (Timor-Leste and Cuba).\n<\/p>\n<p>Figure B shows the distribution of the same variable, but this time broken down by region.  Here we can see that, as a percent of GDP, educational expenditures don&#8217;t vary much by region. Within most regions, however, there are outliers or extreme outliers, that is, countries that spend a much larger or smaller share of their GDP on education than do other countries in the same region.\n<\/p>\n<p>Download the image: \/\/wikieducator.org\/images\/0\/0f\/FigureB-boxplots2.gif\n<\/p>\n<div class=\"thumb tright\">\n<div class=\"thumbinner thumbnail\" style=\"width:182px;\"><a href=\"http:\/\/WikiEducator.org\/File:FigureB-boxplots2.gif\" class=\"image\"><img loading=\"lazy\" decoding=\"async\" alt=\"\" src=\"\/\/WikiEducator.org\/images\/thumb\/0\/0f\/FigureB-boxplots2.gif\/180px-FigureB-boxplots2.gif\" width=\"180\" height=\"143\" class=\"thumbimage img-responsive\"><\/a>  <\/p>\n<div class=\"thumbcaption\">Boxplot illustration for research methods subject<\/div>\n<\/div>\n<\/div>\n<p><br style=\"clear:both;\">\n<\/p>\n<h1><span class=\"mw-headline\" id=\"Key_points\">Key points<\/span><\/h1>\n<div class=\"panel iDevice\">\n\t<div class=\"panel-heading idevice-heading\">\n\t\t<div>\n\t\t\t<img decoding=\"async\" class=\"pedagogicalicon\" alt=\"key points\" src=\"https:\/\/course.oeru.org\/research-methods\/wp-content\/themes\/oeru_course\/idevices\/Icon_key_points.png\">\n\t\t<\/div>\n\t\t<div>\n\t\t\t<h2>Descriptive statistics: key terms<\/h2>\n\t\t<\/div>\n\t<\/div>\n\t<div class=\"panel-body\">\n\t\t<div class=\"col-md-12\">\n\t\t\t<\/p>\n<ul>\n<li> <i><b>Boxplot or box and whiskers plot<\/b><\/i> &#8211; a chart for examining the distribution of a continuous variable\n<\/li>\n<li> <i><b>Dispersion<\/b><\/i> &#8211; how spread out the values of a continuous variable are.\n<\/li>\n<li> <i><b>Inter quartile range<\/b><\/i> &#8211; The range between the 25<sup>th<\/sup> and the 75<sup>th<\/sup> percentile of of a continuous variable\n<\/li>\n<li> <i><b>Mean<\/b><\/i> &#8211; Also called &#8216;average&#8217;.  Calculated simply by adding up all the values and then dividing by the number of cases.\n<\/li>\n<li> <i><b>Median<\/b><\/i> &#8211; Calculated by ranking cases from high to low (or vice-versa) and then finding the value of the case that is in the middle (also called the 50<sup>th<\/sup> percentile) of the distribution.\n<\/li>\n<li> <i><b>Mode<\/b><\/i> &#8211; The value that occurs most frequently.\n<\/li>\n<li> <i><b>Standard deviation<\/b><\/i> &#8211; a measure of how spread out the values of a variable are from the mean.\n<\/li>\n<li> <i><b>Variance<\/b><\/i> &#8211; a measure of how spread out the values of a variable are from the mean.\n<\/li>\n<\/ul>\n<p>\n<\/p>\n<p>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n<p>\n<\/p>\n<div class=\"panel iDevice\">\n\t<div class=\"panel-heading idevice-heading\">\n\t\t<div>\n\t\t\t<img decoding=\"async\" class=\"pedagogicalicon\" alt=\"activity\" src=\"https:\/\/course.oeru.org\/research-methods\/wp-content\/themes\/oeru_course\/idevices\/Icon_activity.png\">\n\t\t<\/div>\n\t\t<div>\n\t\t\t<h2>Activities: find descriptive statistics<\/h2>\n\t\t<\/div>\n\t<\/div>\n\t<div class=\"panel-body\">\n\t\t<div class=\"col-md-12\">\n\t\t\t<\/p>\n<ol>\n<li> Find each of the descriptive statistical terms reported in academic journal articles\n<\/li>\n<li> Upload a screen shot of one of these pictures to the discussion forum\n<\/li>\n<li> Describe what point it was used to prove\/illustrate\/show.\n<\/li>\n<\/ol>\n<p>\n<\/p>\n<p>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n<p>\n<\/p>\n<h1><span class=\"mw-headline\" id=\"Notes\">Notes<\/span><\/h1>\n<ol class=\"references\">\n<li id=\"cite_note-1\"><span class=\"mw-cite-backlink\"><a href=\"#cite_ref-1\">\u2191<\/a><\/span> <span class=\"reference-text\"> Sometimes, in addition to being ordered, the differences (or intervals) between any two adjacent values on a measurement scale are the same.  For example, the difference in temperature between 80 degrees Fahrenheit and 81 degrees is the same as that between 90 degrees and 91 degrees.   When each interval represents the same increment of the thing being measured, the measure is called an interval variable. <\/span>\n<\/li>\n<li id=\"cite_note-2\"><span class=\"mw-cite-backlink\"><a href=\"#cite_ref-2\">\u2191<\/a><\/span> <span class=\"reference-text\"> The formula for the population variance is <span class=\"mathjax-wrapper\">\\( \\sigma^2 = \\dfrac {\\sum_{i=1}^N (X_i &#8211; \\mu)^2;}{N} \\)<\/span><\/span>\n<\/li>\n<li id=\"cite_note-3\"><span class=\"mw-cite-backlink\"><a href=\"#cite_ref-3\">\u2191<\/a><\/span> <span class=\"reference-text\"> Richard M. Scammon and Ben J. Wattenberg, The Real Majority (N.Y.: Coward-McCann, 1970), 70 <\/span>\n<\/li>\n<\/ol>\n<h1><span class=\"mw-headline\" id=\"Credits\">Credits<\/span><\/h1>\n<p>John L. Korey 2013, POLITICAL SCIENCE AS A SOCIAL SCIENCE, Introduction to Research Methods in Political Science:<br \/>\nThe POWERMUTT* Project, <a rel=\"nofollow\" class=\"external autonumber\" href=\"http:\/\/www.cpp.edu\/~jlkorey\/POWERMUTT\/Topics\/political_science_as_a_social_science.html#note1\">[2]<\/a>\n<\/p>\n<p><!-- \nNewPP limit report\nCPU time usage: 0.064 seconds\nReal time usage: 0.065 seconds\nPreprocessor visited node count: 376\/1000000\nPreprocessor generated node count: 1302\/1000000\nPost\u2010expand include size: 5578\/2097152 bytes\nTemplate argument size: 2610\/2097152 bytes\nHighest expansion depth: 7\/40\nExpensive parser function count: 0\/100\n--><\/p>\n<p><!-- Saved in parser cache with key wikiedu-mw_:pcache:idhash:173963-0!*!0!!en!2!* and timestamp 20161031073959 and revision id 1007910\n -->\n<\/div>\n<div class=\"visualClear\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"row\">\n<div class=\"col-md-12\">\n<ul class=\"pager\">\n<li class=\"previous\">\n            <a href=\"\/research-methods\/modules-4-6\/module-4-quantitative-methods\/statistics-how-to-measure\">\u2190 Previous<\/a>\n          <\/li>\n<li class=\"next\">\n            <a href=\"\/research-methods\/modules-4-6\/module-4-quantitative-methods\/statistics-presenting-categorical-data\">Next \u2192<\/a>\n          <\/li>\n<\/ul><\/div>\n<\/p><\/div>\n<\/div>\n<footer>\n<br \/>\n<\/footer>\n","protected":false},"excerpt":{"rendered":"<p>Before we begin to analyse statistical data, we need to get comfortable with it. So at first, simply to describe the distribution of one variable at a time. This is also called univariate analysis. Central Tendency The central tendency of a variable is nothing more than its average value. There are, however, different kinds of [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":3285,"menu_order":3900,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-3295","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/pages\/3295","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/comments?post=3295"}],"version-history":[{"count":1,"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/pages\/3295\/revisions"}],"predecessor-version":[{"id":3296,"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/pages\/3295\/revisions\/3296"}],"up":[{"embeddable":true,"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/pages\/3285"}],"wp:attachment":[{"href":"https:\/\/course.oeru.org\/research-methods\/wp-json\/wp\/v2\/media?parent=3295"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}