Question: Week 1 Lecture 1 Class Approach to Statistics Statistics is basically a set of tools that allow us to get information out of data sets
Week 1 Lecture 1 Class Approach to Statistics Statistics is basically a set of tools that allow us to get information out of data sets (we will get to the more formal definition below). As such, it can be taught as a math class (focusing on formulas), a logic class (If this, then that), or as a case study (here is the problem, what are we going to do). We have chosen the later - we will be examining statistical tools and approaches as they help us answer a business question. The question we will focus on involves the Equal Pay Act, specifically the requirement that males and females be paid the same if they are performing equal or equivalent work. So, our business research question is: are males and females paid the same for equal work? In starting out with our case, we will have a data set that provides a number of variables (measures that can assume different values with different subjects) for each of 50 employees selected randomly from our company. (The company and employee data are fictitious, of course). For each employee (labeled 1 thru 50 in the ID column), we will have: Salary, the annual salary, rounded to the nearest hundred dollars; for example, a salary of 32, 650 would be rounded to 32.7. Compa (short for compa-ratio or Comparative ratio) - a measure of how a salary relates to the midpoint of a pay range, found by dividing the salary by the pay range midpoint. Midpoint - the middle of the salary range assigned to each grade. Age - the employee's age (rounded to the nearest birthday) Performance rating - a value between 1 and 100 showing the manager's rating how good the employee performs their job Service - the years the employee has been with the company (rounded to the nearest hiring anniversary Gender - a numerical code indicating the employee's gender (1 = female, 0 = male) Raise - the percent increase in pay of the last performance based increase in salary Degree - the educational achievement of the employee (0 = BA/BS, 1 = Master's or more) Gender1 - a letter code indicating the employee's gender (F = female, M = male) Grade - the employee's pay level - grade A is the lowest (entry level) and grade E is the highest. During each week, we will examine some of these variables to see if they help us answer the question of males and females receiving equal pay for equal work. In the weekly lectures, we will work with the variable salary. In the homework assignments for weeks 2, 3, and 4; you will have the same questions but work with the variable compa, which - by definition - is an alternate method of looking at pay. If you have any questions about this description of our course case, please ask them in either Ask Your Instructor or in one of the class posts. Introduction to Statistics Formally, we can define statistics as \"the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions\" (Lind, Marchel, & Wathen, 2008, p. 4). This makes statistics and statistical analysis a subset of both critical thinking and quantitative thinking, both skills that Ashford University has identified as critical abilities for any student graduating with a degree. H. G. Wells, the author, once said that \"one day quantitative reasoning will be as necessary for effective citizenship as the ability to read.\" In this class, we will focus mostly on the analyzing and interpreting of data that we will assume has been correctly collected to allow us to use it to make decisions with. In doing this, there is a fairly well agreed upon approach to understanding what the data is trying to tell us. This approach will be followed in this class, and involves: Identifying what kinds of data we are working with, then Developing summary statistics for the data Developing appropriate statistical tests to make decisions about the population the data came from. Drawing conclusions from the test results to answer the initial research question(s). Data Characteristics We all recognized that not all data is the same. Saying we \"like\" something is quite a bit different than saying, the part weighs 3.7 ounces. We treat these two kinds of data in very different ways. The first distinction we make in data types involves identifying our data as either qualitative or quantitative. Qualitative data identifies characteristics or attributes of something being studied. They are non-numeric and can often be used for grouping purposes. Some examples include nationality, gender, type of car, etc. Quantitative data, on the other hand, tend to measure how much of what is being examined exists. Examples of these kinds of variables include, money, temperature, number of drawers in a desk, etc. Within quantitative data, we can identify continuous and discrete data types. Continuous data variables can assume any value with limits. For example, depending upon how accurate our measuring instrument is, the temperature, in degrees Fahrenheit, could be 75, 75.3 75.32, 75.3287468.... There are no natural \"breaks\" in temperature even though we typically only report it in whole numbers and ignore the decimal portion. Height would be another continuous data variable. Discrete data, on the other hand, has only certain values, and shows breaks between these values. The number of drawers in a desk could be 3 or 4, but not 3.56, for example. The second important approach in defining data is the \"level\" of the data. There exist four distinct levels: Nominal - these serve as names or labels, and could be considered qualitative. The basic use for this level is to identify distinctions between and among subjects, such as ID numbers, gender identification (Male or Female), car type (Ford, Nissan, etc.). We can basically only count how many exist within each group of a nominal data variable. Ordinal - these data have the same characteristics as nominal with the addition of being rankable - that is, we can place them in a descending or ascending order. One example is rating something using good, better, best (even if coded 1 = good, 2 = better, and 3 = best). We can rank this preference, but cannot say the difference between each data point is the same for everyone. Interval - this level of data adds the element of constant differences between sequential data points - while we did not know the difference between good and better or better and best; we do know the difference between 57 degrees and 58 degrees - and it is the same as the difference between 67 and 68 degrees. Ratio - this level adds a \"meaningful\" 0 - which means the absence of any characteristic. Temperature (at least for the Celsius and Fahrenheit scales)) does not have a 0 point meaning no heat at all. A scale with a meaningful 0, such as length, has equal ratios - the ratio of 4 feet to 2 feet has the same value as that of 8 feet to 4 feet - both are 2. This cannot be said of temperatures, for example (Tanner & Youssef-Morgan, 2013). These are often recalled by the acronym NOIR. Knowing what kinds of data we have is important, as it identifies what kinds of statistical analysis we can do. Equal Pay Question At the end of each lecture, we will apply the topics discussed to our research question of do males and females receive equal pay for equal work. In this section, we will look at identifying the data characteristics for each of our data variables. In looking at our first classification of qualitative versus quantitative, we have Qualitative ID Gender Gender1 Degree Grade Quantitative Continuous Discrete Compa Salary Age Midpoint Performance Raise Rating Service Most of these are fairly clear - the variables in the qualitative column merely identify different groups. The continuous variable lists can all - theoretically - be carried out to many decimal points, while those in the discrete list all have distinct values within their range of available values. The identification for the NOIR classification are shown below. Nominal Ordinal ID Gender Gender1 Degree Grade Interval Performance Rating Ratio Salary Midpoint Service Compa Age Raise While an argument can be made that Performance Ratings, being basically opinions, are really ordinal data; for this class let us assume that they are interval level as many organizations treat them as such. An important reason for always knowing the data level for each variable is that we are limited to what can be done with different levels. With nominal scales, we can count the differences. With ordinal scales, we can do some limited analysis of differences using certain tests that are not covered in this course. Both interval and ratio scales allow us to do both inferential and descriptive analysis (Tanner & Youssef-Morgan, 2013). Most of the statistical tools we will cover in this class require data scales that are at least interval in nature. During our last two weeks, we will look at some techniques for nominal and ordinal data measures. In Lecture 2, we will start to see what kinds of things we can do with each level of the NOIR characteristics. If you have any questions about this material, please ask questions in either Ask Your Instructor or in the discussion area. References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for Managers. San Diego, CA: Bridgeport Education. Week 1 Lecture 2 In Lecture 1, we focused on identifying the characteristics - quantitative, qualitative, discrete, continuous, NOIR - of the data. In this section, we will take a look at how we can summarize a data set with descriptive statistics, and how we can ensure that these descriptive statistics can be used as inferential statistics to make inferences and judgments about a larger population. We are moving into the second step of the analysis approach mentioned in Lecture 1. Descriptive Statistics Once we understand the kinds of data we have, the natural reaction is want to summarize it - reduce what may be a lot of data into a few measures to make sense of what we have. We start with summary descriptions, the principle types focus on location, variability, and likelihood. (Note, we will deal with likelihood, AKA probability, in Lecture 3 for this week.) For nominal data, our analysis is limited to counting how many exist in each group, such as how many cars by car company (Ford, Nissan, etc.) are in the company parking lot. However, we can also use nominal data as a group name to form different groups to examine, in this case we do nothing with the actual data label, but do some analysis with the data in each group. An example related to our class case: we can group the salary data values into two groups using the nominal variable gender (or gender1). With ordinal scales, we can do some limited analysis of differences using certain tests; most of which are not covered in this course. We can also use ordinal data as grouping labels, for example we could do some analysis of salary by educational degree. Both interval and ratio scales allow us to do both inferential and descriptive analysis (Tanner & Youssef-Morgan, 2013). Most of the statistical tools we will cover in this class require data scales that are at least interval in nature. Location measures. When working with interval or ratio level variables, the first measure most researchers look are indications of location - mean, median, mode. The mean is the numerical average of the data - simply add the values and divide by the total count. The median is the middle of the data set; rank order the values form low to high or high to low, and pick the value that is in the middle. This is easy if we have an odd number of values, we can find the middle exactly. If we have an even number of variables, the middle is the average of the middle two values. For example, in this data set: 2, 3, 4, 5, 6, we have five values and the median is 4. However, in this data set: 2, 3, 4, 5, we have only four values and the median is the average of the middle 2 numbers = (3 + 4)/2 = 7/2 = 3.5. Finally, the mode is the most frequently occurring value; as such, it may or may not occur. And, there may be more than one mode in any data set. Generally, the mean is the most useful measure for a data set, as it contains information regarding all the values. It is the location measure that is used in many statistical tests. The symbol for the mean of a population is - called mu - while we use - sometimes typed as xbar - for the sample mean. Variation measures. After finding our mean (or other center measure), we generally want to know how consistent the data is - that is, is the data bunched around the center, or is it spread out. The more spread out a data is, the less any single measure accurately describes all of the data. Looking at the consistency (or lack of consistency) in a data set will often give us a different understanding of what is going on. A simple example, if we have two departments in a company that each averaged 3.0 on a question in a company morale survey, we might be tempted to say they were the same. However, if we looked at the actual scores and saw that one department had individual scores of 3, 3, 3, 3, 3, and 3 while the other department's scores were 5, 5, 5, 1, 1, and1 we can now see that the groups are quite a bit different. The mean alone did not provide enough information to interpret what was going on in each group. We have 3 general measures of variation - range, standard deviation, and variance. Range is simply the difference between the largest and smallest value (largest - smallest = range). Standard deviation and variance are related values. The variance is a somewhat awkward measure to initially understand. To calculate it, we first take the difference between each value and mean of the entire group. This outcome will have both positive and negative values, and if we add them together we would get a result of 0. So, to eliminate the negative values, we square each outcome. Then we get the sum these squared values and divide it by the count. (Note: this is the same as the mean of the squared differences.) For example, the variance of this data set (2, 3, 4) would be: Mean = (2 + 3 + 4)/3 = 9/3 = 3 Variance = ((2 - 3)^2 + (3 - 3)^2 + (4 - 3)^2)/3 = ((-1)^2 + (0)^2 + (1)^2))/3 = (1 + 0 + 1)/3 = 2/3 = 0.667. This gives us an awkward measure - the variance of something measured in inches, for example, would be measured in inches squared - not a measure we all use on a daily basis. The standard deviation changes this awkward measure to one that makes more intuitive sense. It does so by taking the positive square root of the variance. This would give us, for our inches measure a result that is expressed in inches. The standard deviation is always expressed in the same units as the initial measure. For our example above with the variance of 0.667, the standard deviation would be the square root of 0.667 or 0.817. Both the variance and standard deviation require data that is at least interval in nature. The standard deviation is about 1/6 of the range, and is considered the average difference from the mean for all of the data values in the set (Tanner & Youssef-Morgan, 2013). Technical point - both the variance and the standard deviation have two different formulas, one for populations and one for samples. The difference is that with the sample formula, the average is found with the (count -1) rather than the full count. This serves to increase the estimate, since the data in a sample will not be as spread out as in the population (unlikely to have the extreme largest and smallest value). The symbol for the population standard deviation is , while the sample standard deviation symbol is s. In statistics, since we deal with samples, we use the sample formulas - to be discussed below. The nice thing about descriptive statistics is that Excel will do all of the math calculations for us, we just need to know how to interpret our results. For a video discussion of descriptive statistics take a look at Descriptive statistics from the Kahn Academy - https://www.khanacademy.org/math/probability/descriptive-statistics. Research Question Example Now that we have identified the data types for each of our variables, we need to develop some descriptive statistics - particularly for those at the interval and ratio level. In our discussion and example of salary, we will be using a salary sample of 50 that does not exactly match the data that is available in your data set. It is not significantly different, and should be considered to come from a different sample of the same population. The results will be accurate enough to consider them in answering our equal pay for equal work question for the sample results provided to the class. Equal Pay Question. The obvious first question to ask is what is the overall average salary, and what is the average for the males and females separately? This descriptive statistic should also be accompanied with the standard deviation of each group to examine group diversity. (Reminder: the salary results presented each week will not exactly match the results from this class' data set if you choose to duplicate the results presented in this lecture. The results are statistically close enough to use to answer our assignment question on equal pay.) The related question concerns the standard deviation of each of the three groups (entire sample, males, and females) - what is the standard deviation for each group. In setting up the data for this, copy the salary data column (B1:B51) and paste it on a new sheet. This is a recommend practice - never do analysis on the raw data set so that relationships between various columns are not compromised. Then copy the Gender column (M and F) and paste it beside the salary data. Using Excel's sort function, sort the two columns (at the same time) using Gender as the sort key. This will give you the salary data grouped by males and females. The screen shot below displays the results using both the Descriptive Statistics option found in the Data Analysis list and the =Average and = Stdev.s functions found in the fx or Formulas - statistics section. Note a couple of things about the Descriptive Statistics output. First, since for both the overall and female groups, the input range included the label Sal, this was shown at the top. The male range did not have a label, so Column 1 was automatically used. We can use Descriptive Statistics for any number of contiguous columns in the input range box. For reporting purposes, we should change the Sal and Column 1 labels to Overall, Female, and Male. The second issue about the descriptive statistics output is that it contains much more information than we were looking for. This is a good tool for an overall look at a data set. Looking at the fx values and those from the Descriptive Statistics output, we can see that the means and standard deviations are identical for each group - so, it does not matter which approach you use. Now, looking at the actual statistical values, we see that the overall all salary mean (45) lies in the middle between the lower female mean (38) and the upper male mean (52) - overall means will always be flanked by sub-group means, but the differences will not also be equidistant. The standard deviations, on the other hand are much closer together with the overall (19.2) being somewhat larger than either the female (18.3) or male (17.8). This is also somewhat common - the variation in the entire group is generally a bit larger than for the sub-groups. While we did not specifically ask for it, we can also note that the range in each group is very close 22 - 77 for overall and females and 24 - 77 for males. So, what can we say at this point? It appears that males and females have about the same range and standard deviations for salaries, but that females appear to average less than the males. However, at this point, we cannot say anything about our equal pay for equal work question as the Salaries have not been divided into equal work groups. So, at this point we have some interesting information, but no conclusive results yet. References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for Managers. San Diego, CA: Bridgeport Education. Week 1 Lecture 3 A second way of looking at data differences or similarities is to consider how likely a given outcome is. In looking at our data set, we could ask questions such as, what is the probability (likelihood) of a male or female salary exceeding 60K, what is the probability that a person's salary is within the range of 38K to 52K, etc. Probability questions about a data begin to help us look at distributions, a topic we will delve into in more detail in the upcoming weeks. Probability is the likelihood that a specific outcome will occur; it is always positive and ranges from 0 (will never occur) to 1.00 (will always occur). Generally speaking, we have 3 kinds of probability - empirical (counting actual outcomes), theoretical (using theory/logic to determine what should occur) and subjective (our individual guesses and feelings). Obviously the theoretical and empirical are the best approaches for business research questions, but at times the best we can get is an expert's guess. Theoretical probability is just as it sounds - the theory of what the probability should be. For example, if we flip a fair coin, our theory says we should get heads 50% of the time - one outcome out of the two possible. If we flip the coin a number of times, we will get the empirical probability - the number of actual heads divided by the number of flips. While this is generally close to .5, achieving this is usually the result of a lot of flips rather than just a few (even up to 100) (Lind, Marchel, & Wathen, 2008). While many approaches to theoretical probability exist (binominal, hypergeometric, Poisson, etc.) (Lind, Marchel, & Wathen, 2008) exist, we will look at just two particular types - the binominal and the normal curve based probabilities. The binominal requires that we have only two outcomes, such as heads and tails when flipping a coin. This is not as restrictive as it might seem, as we can always create 2 groups out of what we have. For example, if we have a single die (one of a pair of dice), we could form several two group situations - evens versus odds, 1 - 3 versus 4 - 6, etc. We will use the binominal to discuss several basic probability rules. Four general probability (P) concerns exist. Typically, we want to know one or more of the following probabilities: of something happening - called P(event), of two things happening together - called joint probability: P(A and B), of either one or the other but not both events occurring - P(A or B), of something occurring given that something else has occurred - conditional probability: P(A|B) (read as probability of A given B). Compliment rule: P(not A) = 1- p(A) (Lind, Marchel, & Wathen, 2008). Two other issues are needed, the idea of mutually exclusive means that the elements of one data set do not belong to another - for example, males and pregnant are mutually exclusive data sets. The other term we frequently hear with probability is collectively exhaustive - this simply means that all members of the data set are listed (Lind, Marchel, & Wathen, 2008). Some rules, which apply for both theoretical and empirical based probabilities, for dealing with these different probability situations include: P(event) = (number of success)/(number of attempts or possible outcomes) P(A and B) = P(A)*P(B) for independent events or P(A)*P(B|A) for dependent events (This last is called conditional probability the probability of B occurring given that A has occurred). P(A or B) = P(A) + P(B) - P(A and B); if A and B cannot occur together (such as the example of male and pregnant) then P(A and B) = 0 P(A|B) = P(A and B)/P(B) (Lind, Marchel, & Wathen, 2008). Binominal Probability Binominal probabilities deal with dichotomous outcomes - those that have only 2 possible outcomes. A typical example is flipping a coin, the result can only be a head or tail. Another common example is gender, we are born as either male or female. The interesting element about binominal outcomes is that while every single trail (such as the flip of a coin) has the same probability, the outcome of a group of trails will not necessarily match that probability. For example, the probability of getting exactly 5 heads out of 10 flips of a fair coin is not .5, but rather 24.6%! This is due to the number of ways the 10 outcomes can be distributed (Lind, Marchel, & Wathen, 2008). We can turn almost any outcome into a dichotomous outcome by creating groups. For example, we can say that when we toss a six-sided die (half of a pair of dice), we have two outcomes: getting a 1 or 2 versus getting anything else. Now we have two outcomes of interest instead of the original 6 possible outcomes. Tables exist to determine the likelihood, but the easier way is to use the Excel functions found in the fx or Formulas lists. For example, Excel's BINOM.DIST functions can quickly provide us with the correct probability of getting a certain number of outcomes within a given number of attempts. Research Question Example Understanding the distribution of the data is an important element of understanding what the data is trying to tell us. Probabilities can give us a sense of the data set and allow us to compare results across groups. Equal Pay Example. In thinking about equal pay, we might be interested in the probability that both males and females appear to be grouped in similar ways as the overall group. This would be an example of an empirical probability, as we would be counting how many of each group fall into each of the ranges we would set up. We noted that the overall salary mean was 45, with the female mean equaling 38 and the male mean equaling 52. This suggests one group to look at - what is the probability of someone having a salary between 38 and 52 in each group - overall, females, and males? Translating this into \"probability\" terms, we want to know: What is p(38 <= salary <=52)? What is the probability that salaries are between 38 and 52 inclusive? What is p(38 <= salary <=52|Female)? What is the probability that salaries are between 38 and 52 inclusive, given a female salary? Or, if a female, what is the probability that salaries are between 38 and 52 inclusive? What is p(38 <= salary <=52|Male)? What is the probability that salaries are between 38 and 52 inclusive, given a Male salary? Or, if a Male, what is the probability that salaries are between 38 and 52 inclusive? We know a couple of things right off. First the entire sample has 50 members, and we have 25 males and females. These become the denominators in the respective probabilities. Since you do not have the exact data set we are working with, the counts for salaries in these ranges are: Overall: 8, Females: 3, and Males: 5. So, we have: P(38 <= salary <=52) = 8/50 = .16. P(38 <= salary <=52|Female) = 3/25 = .12. P(38 <= salary <=52|Male) = 5/25 = .20. We can see if gender influences being within this range by seeing if the formula for independent events is true. Above we had stated P(A and B) = P(A)*P(B) for independent events. In this case, the P(within salary range AND Female) is the same as P(within salary range|Female); (this is not always the case). So, since P(within salary range) = .16, and P(Female) = .5, we would have: P(within salary range and Female) = P(within salary range) * P(Female) Replacing these with the associated values, we would have: .12 = .16 * .5 (= .08). An expression that is clearly not correct or true. Since the two sides of the equation are clearly not the same; we can say that gender and being within this salary ranger are not independent elements. Doing this for other ranges produces similar results, so we have a clue that gender and salary interact in some ways that suggest males and females are not paid equally. What we still do not know yet, is how to consider equal work in our examination. The Normal Curve The normal curve is a data distribution that is often called the bell curve, as when you plot the likelihood of outcomes occurring, the resulting graph looks like a bell - the most outcomes in the middle (where the mean = median = mode), and then smoothly decreasing on each side. As a probability distribution, the normal has some interesting characteristics. First, the probability of any outcome equals the area under the curve for that range of outcomes. (Tables and Excel give us these values.) Second, the curve technically extends from - infinity to + infinity (although this range is rarely actually used). Third, since the normal curve is continuous data, the probability of any single outcome (for example, getting a 76 on a test) is 0, so to overcome this we develop a range of values - the 76 score outcome would be the area from 75.5 to 76.5 - the adding of +/- half a unit to a value allows us to translate discrete data into a continuous range (Lind, Marchel, & Wathen, 2008). The normal curve is important due to its wide spread appearance in everyday situations. Some examples of data that follow the normal curve are height, weight, IQ, standardized test scores such as the college boards, many manufacturing measures (above and below the average result), etc. To make working with different normal curves (having different means and standard deviations), we can convert them all into the Standard Normal Curve, which has a mean of 0 and a standard deviation of 1.0. We do this using a z-score - subtract the mean from the data value, and divide the result by the standard deviation. Doing this for every value in a data set would change the mean of the new distribution to 0 (due to the subtraction), while the division changes the standard deviation to 1. The resulting data values are now z-scores, and the area between z-scores is the probability of an outcome within that range of values. One characteristic of the z-score is that it tells us, in standard deviations, how close or far from the mean any individual score is; so in some ways this is another location measure but one that focuses on individual values. Here is an example using the normal distribution and related Excel functions found in the fx list. (See the Excel Week 1 lecture for guidance on using this function if you are unclear about it.) To find the probability of an outcome between a z-score of 1.63 and 2.0, we would need to find the area between these two scores. To do this, we would subtract the area under the curve up to the z score of 1.63 from the area under curve up to the area of the z-score for 2.0. In excel we use the fx function NORM.S.DIST (z, cumulative) this way\" =NORM.S.DIST(2.0,1)-NORM.S.DIST(1.63,1) = 0.0288. This tells us that the probability of finding a sample value within this range is about 2.9%. A second example with values that are above and below the mean would be done the same way. Looking to find the probability that we would find a sample value between the z scores of -1.63 and 2.0 would be: =NORM.S.DIST(2.0,1)-NORM.S.DIST(-1.63,1) = 0.9257 or 92.6%. The final example of finding a normal curve based probability is determining the probability of being greater than some value; for example, what is the probability of exceeding a z score of 2.0? =1 - NORM.S.DIST(2.0,1) = 0.02275 or 2.3% Finding the area below a negative z score is found by simply using the NORM.S.DIST function. = NORM.S.DIST(-2.0,1) = 0.02275 or 2.3% A hint on doing these kinds of problems is to draw a picture of the normal curve, and draw a vertical line at each of the z score values you are working with. Then shade the area you are interested in. There are three cases - the area below a certain value, the area above a certain value, and the area between two values. This visual guide helps in determining what we subtract from what. Side note - the probability of exceeding a particular outcome by pure chance alone is called the p-value. We will start using this idea next week. Research Question Example We will be assuming that the variables we are using for our equal pay question come from a normally distributed population. This allows us to use normal curve based probabilities and statistical tests to examine the data we are using to answer our question. Equal Pay Example. Earlier we found that the likelihood of males and females having a salary between 38 and 52K were not the same, suggesting that gender and salary interacted in some way. Let us ask if the probability of having a salary greater than the overall mean of 45 is the same for both genders. Since we are assuming that salary is normally distributed as a whole and for each gender, we can use a normal curve probability to examine this. The first step is to find the z scores for each gender for the data value of 45. We found earlier that the female mean is 38, and the sample standard deviation is 18.3, and the male mean and standard deviation is 52 and 17.8 respectively. This gives us the information needed to determine our z scores: Female z = (45-38)/18.3 = (7)/18.3 = 0.38. Male z = (45-52)/17.8 = (-)7/17.8 = -0.39. The second step is to find the probability of exceeding each of these values using the NORM.S.DIST function. Female: =1-NORM.S.DIST(0.38,1) = 0.352 Males = 1-NORM.S.DIST(-0.39,1) = 0.652 So, it again appears that males and females have a different salary distribution as males are almost twice as likely to be above the overall average of 45 as females are. Again, we have not yet considered the equal work element. References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin