Question: In Session 1 we discussed different types of data. Session 2 is titled Calculating and Displaying Descriptive Statistics How do we describe and display categorical
In Session 1 we discussed different types of data. Session 2 is titled "Calculating and Displaying Descriptive Statistics" How do we describe and display categorical data? Well we are limited but it can be done. We will use tables and graphs. The tables we use are either one way frequency tables or two way frequency tables - one way if there is only one categorical variable; two way if there are two. All we do is count up the frequency or number of times each categorical value occurs. We will use graphs such as the bar chart, pie chart, and pareto chart to complement the tables. Basically graphs summarize or augment the tables. Spiffy visual displays are often more insightful. Let's do some examples: Theproblem Employee Education employee education. The first thing to do is table the data.
It is important to note here that the data as originally presented was already summarized. It was not raw data. There was raw data collected initially but it was summarized. You can imagine what the survey would have asked each employee. So working with this table we need to see how it can be informative to the reader. Recall these are categorical variables, not numerical variables. The frequencies are numbers but not the variables. Use an excel spreadsheet and construct a third column, percentage. Note the simple formula when you click on a cell in the percentage column. This helps us to interpret the data. See if you can come with some general conclusions about education level of all the employees? Here is an example. I am sure you can do better. Most employees (about three quarters) have either a bachelors degree (44%) or no college (32%). About 10% have Masters, 8% have Associates, and nearly 6% have PhDs. About 85% have no degree or a degree other than a graduate degree. Only 15% have an advanced degree. 32% have no degree at all. The excel sheet shows a bar chart and pie chart. You should be able to get those using the "Insert" command in excel. Highlight the data you want to be represented in the graph. Click on Insert and then click either the picture of a bar chart or the picture of a pie chart. Note that data on tenure are also collected for each employee and summarized in the table. Now there are two categorical variables. This is called a two way table or contingency table. One is education level and the other is time or tenure with the company. This is an important point. Time with the company is numerical (years). But we group this and create categories. We do this to make the description simpler to understand. So if tenure for an employee was anything less than 1 year, we put that employee into the first category. There was 164 employees with less than 1 year. If tenure was 1 year up to 5 years, we put those employees into the second category. We have created 3 tenure values for the categorical variable. The two way table is shown in the excel spreadsheet. For example, there was 63 employees who had a BA degree and had worked for the company more than 5 years. The table has a column at the far right. It is labeled Total and is called a marginal distribution. It consists of the total of all cells on each row. At the bottom is a row labeled Total. It consists of the sum of each cell in each column. It is also called a marginal distribution. Here are the answers to the 4 questions. See if you get these. 32% or .32 .6% or .0059 .9% 15.8% Calculate column percentages. These are shown on the spreadsheet. These are important for interpreting the data. Note that each column adds up to 100%. You could also describe row percentages if you wanted to. More than 2/3 of those with no college degree have been with the company longer than 5 years but almost none of the PhDs (less than7%) have been there that long. It appears that over the past few years, the company has hired proportionately more higher educated employees. They perhaps changed their hiring policies and it shows. Now, let's do a problem showing some descriptive analysis for raw categorical data. Here we construct a table from raw data, not summarized data. TheGradSurvey data file is one I described earlier. Some is categorical and some is numerical. Suppose we wanted to determine a one way table for the variable gender. Gender is categorical. Suppose we wanted to determine a two way table for gender and employment status. Both are categorical. There is a discussion topic in Session 2 on the Grad Survey file example along with some output. Read that and try to get the results using Excel, Pivot Table. Next, we will discuss descriptive statistics for numerical values. But first let's develop a one way table, bar chart, and pie chart for the variable -- Undergrad GPA. Look at the output. mess What is the problem because it would be very difficult to interpret what this means? The problem is that we are applying a simple one way table to numerical data. Recall with continuous data, there are many different values, maybe only one of each. So when you table the data, the frequencies are very low numbers. Consequently the bar chart and pie chart are pretty but meaningless. We will have to do something different. We have to group the data and create a grouped frequency table. Then things will make sense. In order to describe numerical data with descriptive statistics and displays, we need to examine the data. If we look at how the data are distributed, we may see patterns that are important. There are three major characteristics we should look for. They are Shape, Central Tendency, and Variation. Shape means patterns like may be shown in a bar chart. There are things we should look for like symmetry, peaks, gaps, outliers. Look at this link session 2. In the first, the pattern of data is symmetrical. In the second it is not. We call this skewness. It means that the shape is skewed to the right. That means that there are proportionately more high values than low values. You could have skewness to the left where there would be proportionately more lower values. In the third we have multiple peaks or a bi modal shape. There could be two processes going on. It is something we may want to investigate further. In the fourth we have a gap. Something is happening with the data and this is similar to multiple peaks where two processes may be occurring. The fifth shows outliers. There are usually not enough to show a different shape but a few stragglers that throw off the data. These can cause trouble later when you prepare some statistics like averages. They should be investigated. They may be correct or they could be erroneous and caused by misreading or recording incorrectly. There are a few ways of looking at the shape of your data. We will call it the distribution of your data. There is a histogram, a box plot, and a stem and leaf diagram. I will discuss the histogram. You can read about this and the others in your readings for this session. We must group the original numerical data into groups, cells, or bins. I use these terms because you will see them all in your readings. The histogram that we can calculate using Excel/Data Analysis uses "bins". We gather the data so that values falling between one number and another go into that group or bin. Then we have higher counts or frequencies in each newly created bin and you can see what is happening clearer than what that example I showed you earlier displayed. Procedure
Look at this example utility Try to follow all the tabs. What can we say about this? It is a symmetrical distribution and the central tendency is about 147. We will do more on central tendency later. I mentioned earlier the box plot and stem and leaf. Look at them in this link. utility more Again we see a symmetrical distribution. Next let's look at Central Tendency. If we are asked to describe data, one of the useful characteristics would be its central values or where most of the data are located. We will use mean or average, median, and mode. The formula for mean is ??? x/n where we sum all the values of the variable (x) and divide the sum by n, the number of values. We could write a subscript for the summation sign but often leave it off for simplicity. But always be aware of how much you have to sum. The median is the halfway value. If you rank all your numerical values and draw a line at the halfway point, that is the median. Fifty percent of all values will be less than the median and 50 percent will be more. It is called the median or sometimes the 50th percentile or sometimes the second quartile. The mode is the most frequently occurring value. Sometimes there may not be a mode because none of the values occurs more than others. The mode is not used a lot in practice but it is a measure of central tendency. If a distribution is symmetrical, the mean, median, and mode are all the same value. The mean may not be the best measure of central tendency to use for a skewed distribution. The mean is more affected by extreme or outlier values than the median. The median is the middle value of ranked data unlike the mean which weights each value the same, small or large. Consider the values 4, 6, 5, 3, 20. Calculate the mean and median. The mean is 7.6 and the median is 5. Verify this using the formula above. The values are skewed to the right or high side so the mean is larger than the median. Remember that if the distribution is not skewed but is symmetric the mean and median are the same. So it will be better to use the median as the measure of central tendency if the distribution is significantly skewed. When you calculate statistics you should always look at both and compare them. Now let's look at Spread. This is a very important characteristic. It informs you how close together or far apart the individual values are. Is there little variation in which case the histogram would be narrow or is there a lot of variation in which case the histogram would be wide? Look at the diagram in the link SESSION 2 above. You could have two distributions with the same measure of central tendency but quite different measures of spread. Spread is very important. In statistics, variability of data plays a very important rule. For a statistician the less variability the better. That may not be true in other situations but it is in statistics. Statisticians cannot control variability but we deal with it. So we need some measures of spread or variability. Look at the formula on the link SESSION 2. That formula gives you the variance of the data. If you take the square root of variance you get standard deviation. Both are used in statistics. Just remember one is the square root of the other. Try the formula by hand on two data sets. #1 (2, 4, 6, 8) and #2 (2, 14, 26, 94). You should get variance (s2)= 6.7 for #1 and 1696 for #2. Why is #2 so much larger? The formula will explain it. s = 2.6 for #1 and 41.2 for #2. Another measure of spread is Range which is simply the difference between the lowest and highest value. You can see now why a range cannot be calculated for a categorical value but can for a numerical value. You must be able to rank or order values. Another is the Interquartile Range. It is calculated as Q3 - Q1 where Q3 is the third quartile (75% of values are lower than Q3 and 25% are higher) and Q1 is the first quartile (25% are lower and 75% are higher). Thus the IQR is the middle portion of the data. Remember Q2 is the median. Excel has a feature called the Data Analysis Tool that will give you all the descriptive statistics you would ever want. You click on DATA at the top and then Data Analysis at the far right Highlight your data and check whether there are labels or not. Click Descriptives as output. Try it below. Look at savings rate and all its tabs. This is an example of descriptive statistics that could be used to explain the findings. What about shape, central tendency, and spread for these data? What stands out? Coefficient of Variation - This is a statistic that is often used when comparing frequency distributions of numerical values. It includes the two most important statistics (mean and standard deviation). The formula is (std deviation / mean) x 100. For example, if the mean of a distribution is 20 and the standard deviation is 5, the CV is 25%. If the mean is 20 and the standard deviation is 15, the CV is 75%. The latter is a much more variable distribution and spread out more that the first. That can be an important factor. The data analysis tool does not compute this coefficient of variation statistic in its set of descriptives, but you can easily compute it. The relative magnitude of the percentage tells you a lot and makes it easier to compare different data sets. Our final statistic to be discussed is a very important one and we will see it many many times from this point on. Let's make sure we know what it is and why we need it. It is called a z score or z statistic. It is a standardized or normalized value that has no units or dimensions and it is very seldom less than -3 or greater than +3. We can convert any numerical variable no matter how large or small and no matter what the units are into a standardized variable that is between -3 and +3 and has no units. There may be situations where the z value is less than -3 or greater than +3 but very seldom and these represent extreme values. So if we want to standardize a value x and convert it to z, the formula is z = (x - xbar) / s where xbar is the mean of x and s is the standard deviation of x. Look at the formula on the link SESSION 2. By this, we mean x is one value that came from a distribution of numerical values which has a mean of xbar and standard deviation of s. We know how to calculate those. Suppose on the horizontal scale of a bell shaped frequency distribution, we mark off the mean at the center, one standard deviation higher or above the mean, two standard deviations above the mean, and three standard deviations above the mean. Then similarly mark off one standard deviation lower or below the mean, two standard deviations below the mean, and three standard deviations below the mean. Look at the diagram on the link SESSION 2. These same markings are z values of 0, and +1, +2, and +3 to the right of the center as well as -1, -2 and -3 to the left of the center. Just to see why, consider what happens if the x value we want to standardize is the mean plus two standard deviations. So x = xbar + 2s. The z score is (xbar+2s - xbar)/s. Thus z = 2. This can be done with any of the markings. Look at the diagram in the link session 2. Just to show how this works for any numerical value, calculate z score for a x value of 75 feet where we are measuring distance to a monument. Of all the distances we have measured and there are several, the mean value (xbar) is 60 feet and the standard deviation (s) is 10 feet. So the z score is (75ft - 60ft)/10ft = 1.5. It is between -3 and +3 and has no units. Another example: suppose x = $120,000 the cost of a new piece of equipment. We have looked at several and x bar is $140,000 and s is $10,000. Thus the z score for this particular piece is ($120000 - $140000) / $10000 = -2. Again it has no units and is between -3 and +3. The values have been standardized. Later we will see why when we must calculate probabilities from a probability distribution. So, no matter how large or how small the data values are, they can be converted into a score between -3 and plus 3. This is the z score. There is an important theorem in statistics called the empirical rule. It states that 68% (about two thirds) of all the values in a bell shaped distribution are within plus and minus one standard deviation away from the mean. That also means within plus and minus a z score of 1. And, 95% of all the values are within plus and minus a z score of 2. And, 99.7% (almost all) of all the values are within plus and minus a z score of 3. Next we do a problem to test this theory. How close can we come? Look at this problemEnergy and observe all the tabs. You can see that the data generally follow the empirical rule. We used standardized z scores in this problem. One last problem is to show that when you convert non standardized data to standardized date, you can compare and calculate interesting facts. Otherwise working with variables of different dimensions would be very difficult. food and food consumption. If you look at the table of z scores , the last column is the most important. We can compare apples and apples because we have standardized our variables. The highest sum of z scores is Ireland. It is the largest consumer of both meat and alcohol together. | 0 |
|---|
COURSE: MGMT 650
Please include references. Thanks
Note: I did not see where it says "Creat a" or "Write a".
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
