2nd+draft-ME-413

University of Agder

Master in Development Management

ME-413 Research Methods in Development Studies

Task 2.3, Activity B Deadline: 24/02/12 Pages: 17

** Conducting Quantitative Data Analysis **

** Gbowee group: **

Susan Ajambo (**weaver**) Hege Lovise Lundgreen Linn Vestbø Veronica Donoso Orgaz

// Copying text written by other people or using the work of other people without cross-referencing, may be considered as cheating //
 * I/we confirm that I/we do not refer to others or in any other way use the work of others without stating it hence I/we confirm that all references are given in the bibliography. || Yes || No ||

=Table of Contents=

Table of Contents. 1 List of Tables. 1 List of figures. 1 Question 1: variables. 2 Question 2: Univariate Analysis. 3 Question 3- Bivariate Analysis. 7 Question 4: Multivariate analysis. 9 Question 5: Statistical significance. 10 Question 6: More on mean and median. 11 Question 7: Factors influencing wages. 12 Reference: 16

=List of Tables=

Table 1: Gender breakdown. 2 Table 2: Class test 1- mean and median. 4 Table 3: Frequency for class test one- to determine mode. 5 Table 4: Class test 2- Range. 6 Table 5: The relationship between age and country of birth. 9 Table 6: Mean and median wage for public and private sector workers. 10 Table 7: Contingency table showing the relationship between sex and wage. 12 Table 8: Spearman's rho showing the relationship between sex and wage. 12 Table 9: Contingency table showing relationship between birthplace and wage group: 13 Table 10: Spearman’s rho showing the relationship between birthplace (rural/urban) and wage: 13

=List of figures=

Figure 1: Country of birth breakdown. 3

=Question 1: variables=

1.1 Types of variables
Classifications of variables are divided into four main types: interval/ratio variables; ordinal variables; nominal variables; dichotomous variables (Bryman, 2008 p. 321).
 * Interval/ratio variables:** These are variables where the distance between the variables ("the categories" should be inserted here instead of "the variables") are identical across the range of categories.
 * Ordinal variables:** These are variables whose categories can be rank ordered but the distances between the categories are not equal across the range.
 * Nominal variables:** These variables are also known as categorical variables and their categories cannot be rank ordered.
 * Dichotomous variables:** These variables contain data that have only two categories

The types of the 6 variables presented in the “Statistics Notebook” Sheet 1, are therefore, classified as below:
 * Variable 1 = Nominal
 * Variable 2 = Dichotomous
 * Variable 3 = Nominal
 * Variable 4 = Interval/ratio.
 * Variable 5 = Interval/ratio.
 * Variable 6 = Interval/ratio.

1.2 Why it is important to distinguish between the four different types of variable?
Distinguishing between the four different types of variables is important because it helps the researcher to understand his/her data and to determine the appropriate methods of analysis. Conducting quantitative research involves generation of different types of data. The data can be in form of real numbers, lists of categories, in some cases it can be rank ordered while in others it cannot, (Bryman, 2008, p.321). The diversity of the data generated justifies its classification into variables and makes it easier to deal with it (data).

In addition, some data analysis methods are used in relation to some variables and not the others. According to Bryman, (2008, p.325-327), methods such as the arithmetic mean and Pearson’s r for example should only be used in relation to interval/ratio variable. This implies that data analysis involves matching the techniques to the types of variables created and used. Therefore, distinguishing between variables is important in determining the appropriate methods to use for analysis. =Question 2: Univariate Analysis=

Univariate analysis refers to the analysis of one variable at a time and it utilizes techniques such as the frequency table and diagrams like the bar chart, pie chart or histogram. A frequency table provides the number of respondents and the percentages for each category in the variable and it can be used in relation to all kinds of variables. Diagrams too can be used to display data and are relatively easy to interpret and understand. However, the type of variable created from the data determines the diagram to use for analysis, for example, bar charts or pie charts are appropriate for nominal or ordinal variables whereas histograms can be used dealing with interval/ratio variables, (Bryman, 2008, pp. 322-324).

2.1 Frequency table showing the gender breakdown
Table 1: Gender breakdown ||
 * **Statistics** ||
 * Gender ||
 * N || Valid || 10 ||
 * ^  || Missing || 0

There are equal numbers of males and females in the distribution.
 * **Gender** ||
 * || Frequency || Percent || Valid Percent || Cumulative Percent ||
 * Valid || Female || 5 || 50.0 || 50.0 || 50.0 ||
 * ^  || Male || 5 || 50.0 || 50.0 || 100.0 ||
 * ^  || Total || 10 || 100.0 || 100.0 ||   ||

2.2 A bar chart showing the country of birth break down
Figure 1: Country of birth breakdown

The majority of the students are from Uganda (3 persons) while Chile, Norway and USA have the same number of students (one person from each country). Two of the persons are from Ghana.

2.3 Calculate the //mean//, //median//, and //mode// of the first class test results (measure of central tendency)
The Measure of central tendency summarizes a distribution of values in one figure - it presents an average for a distribution. Quantitative data analysis utilizes three different forms of averages explained below:


 * //Mean//**is the average - all values in a distribution is summed up and divided by the number of values present.


 * //Median//**is the mid-point of all the values present. While the mean is vulnerable to outliers (extreme values in either end) the median is not and in case of an uneven number, the mean of the two middle numbers present the median.

Mean and Median for the 1st test class results
Table 2: Class test 1- mean and median

The mean and the median are both 14.

Mode for the 1st test class results
Table 3: Frequency for class test one- to determine mode
 * **ClassTest1** ||
 * || **Frequency** || **Percent** || **Valid Percent** || **Cumulative Percent** ||
 * Valid || 4 || 1 || 10.0 || 10.0 || 10.0 ||
 * ^  || 11 || 1 || 10.0 || 10.0 || 20.0 ||
 * ^  || 12 || 1 || 10.0 || 10.0 || 30.0 ||
 * ^  || 13 || 1 || 10.0 || 10.0 || 40.0 ||
 * ^  || 14 || 2 || 20.0 || 20.0 || 60.0 ||
 * ^  || 15 || 1 || 10.0 || 10.0 || 70.0 ||
 * ^  || 17 || 1 || 10.0 || 10.0 || 80.0 ||
 * ^  || 20 || 2 || 20.0 || 20.0 || 100.0 ||
 * ^  || Total || 10 || 100.0 || 100.0 ||   ||

//Mode// is the value that occurs most often in a distribution (Bryman, 2008, p 325). From the frequency table above, figures 14 and 20 appear twice, while the others only appear once. Therefore, the mode is both 14 and 20.

2.4 Calculate the range of the second class test results (measure of dispersion)
The measure of dispersion refers to the amount of variation in a sample. This can be measured using the range or the standard deviation.

The **range** is the difference between the maximum and the minimum value in a distribution of values associated with an interval ratio variable, (Bryman, 2008, p.325).

Table 4: Class test 2- Range

The range is 14 - the maximum value is 20 while the minimum value is 6, thus the dispersion, or range, between the two is 14. =Question 3- Bivariate Analysis=

3.1 Which measure of correlation should be used to determine a possible relationship between class test 1 and class test 2? Explain your answer.
In order to determine a possible relationship between class test 1 and class test 2 the measure of correlation that should be used is Pearson’s r. The reason for this is that both class test 1 and class test 2 can be defined as interval/ratio variables, as the distance between the categories is identical across the range of categories. This is also how they are defined in question 1. These class tests refer to the amount of correct answers that a person gives in a test consisting of 20 questions. For example, 14 correct answers is 3 less than 17 correct, and 9 correct answers is 3 more than 6 correct answers. Pearson’s r, which is a method for examining relationships between interval/ratio intervals, can both give an indication of the strength of the relationship between the two class tests (varying between 0 and 1, where 1 is perfect relationship and 0 is no relationship between the two variables), and also the direction of this relationship (which depends on whether the coefficient is negative or positive). If the correlation is below 1, this means that class test 2 is related to at least one other variable in addition to class test 1. Pearson’s r can only be applied when the relationship between two variables is linear, and not curved (Bryman, 2008, pp.326-329). That is why this measure is suited to interval/ratio variables, such as in class test 1 and class test 2.

3.2 Which measure of correlation should be used to determine a possible relationship between age and class test 1? Explain your answer.
Also when determining a possible relationship between age and class test 1, Pearson’s r should be used as both of these variables are defined as interval/ratio variables. As in the above answer, also here the distance between the categories is identical across the range of categories. This is not only the case for class test 1, but also for age- as they are separated with one year difference. It should however be mentioned that if people’s ages had been grouped into categories, such as e.g. below 20; 21-30, 31-40 etc, the variable of ages would have to be defined as an ordinal variable, where the categories of the variables can be ranked in order, but the distance between the categories are not necessarily equal across the range (Bryman, 2008, p. 321). In that case Spearman’s rho should be used to determine a possible relationship. Pearson’s r cannot be used, because the relationship between the variables thus not necessarily will be linear (Bryman, 2008, p. 329). Also in Spearman’s rho the outcome will be either positive or negative, and range between 0 and 1.

An important fact to keep in mind is that these methods can only uncover the relationship between variables. These cannot say with confidence which variable causes the other, that is which one is the independent and dependent variable (Bryman, 2008, p.326). ==3.3[N1] Could you determine the relationship between ‘Country of birth’ and class test 1? Explain your answer ==

Yes, the relationship between country of birth and class test 1 can be determined using various measures of correlation as shown below.

To be able to determine the relationship between “country of birth” and class test 1, the first step is to determine what kind of variables “country of birth” and class test 1 are. As “country of birth” is a type of variable with categories that cannot be placed in ranked order, it is a nominal variable (as also stated in question 1). It cannot be stated that being born in South Africa is something more or less than being born in Chile- it is just different. Class test 1, on the other hand, is an interval/ratio variable: they can be placed in ranked order, and the distance between the categories is identical across the range. According to Bryman (2008, p.326) contingency table+chi-square and Cramér’s //V// can thus be used to determine the relationship between “country of birth” and class test 1. If the interval/ratio variable (in this case class test 1) can be identified as the dependent variable, also comparable means+eta could be used (Bryman, 2008, p. 326). Thus, there is a range of choices and alternatives to determine such a relationship. With regard to contingency table, which is the most flexible method for analyzing relationships, it allows analyzing two variables simultaneously, so that relationships and patterns of association can be searched for (Bryman, 2008, pp. 326-327). Cramér’s //V//, which often is reported along with a contingency table, can only indicate the strength of a relationship and not its direction, as this statistic can only take on a positive value (Bryman, 2008, p. 330). The chi-square test is also applied to contingency tables, and allows establishing how confident one can be that there is a relationship between two variables in the population. It should be mentioned that the chi-square value says nothing on its own- it has to be analyzed in relation to the associated level of statistically significance (Bryman, 2008, pp. 334-335).

Comparing means and eta is, as mentioned above, a measure of correlation that could be applied and be very fruitful when one is to measure a relationship between an interval/ratio variable and a nominal variable, and the interval/ratio variable relatively unambiguously could be identified as the dependent variable (Bryman, 2008, p. 330). In the case of the relationship between “country of birth” and class test 1, it can be argued that class test 1, which is the interval/ratio variable, is the dependent variable- the amount of correct answers to class test 1 can by no means affect which country you are born in. It can however be interesting to see if what country you come from, affect the amount of correct answers to class test 1. While means compare the means of the interval/ratio variable for each subgroup of the nominal variable, this is often accompanied with an eta-test, where the statistic expresses the level of association between the two variables. This value will always be positive (Bryman, 2008, p. 330).

It should be mentioned that generalizing the measuring of such correlation might be problematic in this case as the sample in this dataset is very small- only consisting of a total of 10 people (n=10). Only one or two persons, maximum three, represent one country. It can be argued that the sample is too small to be representative to the entire population from which it is selected. Even though the correlation results might be of interest, this limitation is worth keeping in mind. =Question 4: Multivariate analysis=

4.1 What is a spurious relationship?
A spurious relationship is a seeming relationship between two variables but actually this relationship is rather produced by a third variable.Thus, the relationship between the variables is not real. In other words: A spurious relationship occurs when there is a third intervening variable which affects both of the two other variables (Bryman, 2008, pp.330-331). If the third variable is controlled for, the relationship between the two other variables disappears, as it is not a direct one (Bryman, 2008, p. 699).

4.2 Draw a contingency table showing the relationship between age and ‘Country of birth’
Contingency tables make it possible to examine relationships, and can reveal patterns of association (Bryman, 2008, p.327).

Table 5: The relationship between age and country of birth
 * **Age-group * Country of birth Cross tabulation** ||
 * |||||||||||| **Country of birth** || **Total** ||
 * ^  || Chile || Ghana || Norway || South Africa || Uganda || USA ||^   ||
 * **Age-group** || 20 and under || Count || 0 || 2 || 0 || 0 || 0 || 0 || 2 ||
 * ^  ||^   || % within Country of birth || .0% || 100.0% || .0% || .0% || .0% || .0% || 20.0% ||
 * ^  || 21-30 || Count || 1 || 0 || 1 || 0 || 3 || 1 || 6 ||
 * ^  ||^   || % within Country of birth || 100.0% || .0% || 100.0% || .0% || 100.0% || 100.0% || 60.0% ||
 * ^  || 31 and over || Count || 0 || 0 || 0 || 2 || 0 || 0 || 2 ||
 * ^  ||^   || % within Country of birth || .0% || .0% || .0% || 100.0% || .0% || .0% || 20.0% ||
 * Total || Count || 1 || 2 || 1 || 2 || 3 || 1 || 10 ||
 * ^  || % within Country of birth || 100.0% || 100.0% || 100.0% || 100.0% || 100.0% || 100.0% || 100.0% ||

The majority of the students (60%) are in the age category 21-30. The students from South Africa are the oldest (31 and over) while those form Ghana are the youngest (20 and under). All the students from Chile, Norway, Uganda and USA belong to the same age category (21-30). =Question 5: Statistical significance=

//What does it mean to say that a correlation of 0.78 is statistically significant at p < 0.05 //
According to Bryman (2008, p.699, 333), statistical significance is an estimate of how confident a researcher can be that the results from a randomly selected sample are generalizable to the population from which the sample was drawn. This test gives the researcher insight into the risk of concluding that a relationship exists when it doesn’t.

The level of statistical significance therefore, is the level of risk that a researcher is prepared to take when he/she infers that there is a relationship between 2 variables when no such relationship exists. Levels of statistical significance are expressed as probability levels and most social researchers agree that the maximum level of statistical significance that is acceptable is p<0.05 //(p means probability)// which implies that there are fewer than 5 chances in 100 samples that a researcher could have a sample that shows a relationship when in reality there isn’t- the risk is fairly small. Undertaking this test requires a researcher to set up a null hypothesis (a hypothesis that stipulates that two variables are not related) which is then tested. If the findings indicate that the level of statistical significance is p<0.05(the generally acceptable level), this would imply a low risk of inferring a relationship when it actually does not exist thus the researcher would reject the hypothesis and infer that the relationship does exist.

Based on the explanation above therefore, to say that a correlation of 0.78 is statistically significant at p < 0.05 means that the risk of inferring a relationship between two variables whose correlation is 0.78, when it does not actually exist is fairly low with 5 in 100 chances. The null hypothesis can thus be rejected and it can be inferred that there are only 5 chances in 100 that a correlation of 0.78 could have risen by chance alone. =Question 6: More on mean and median= 6.1 Based on the data on ‘sheet 2’ – salary figures for 166 workers, Calculate the mean wage and the median wage for public sector workers and for private sector workers. Table 6: Mean and median wage for public and private sector workers
 * **Employee** || **Mean** || **Median** ||
 * Public workers || 573.1461 || 580 ||
 * Private workers || 597.5974 || 530 ||

6.2 In the public sector, the median is higher than the mean, but in the private sector, the mean is higher than the median. Why is this?
According to Bryman (2008, p.325), the mean is vulnerable to outliers (extreme values at either end of the distribution) which exert considerable upward or down ward pressure on the mean. This explains why the median is higher than the mean in the wages of public sector workers and vice versa. In the case of private sector workers, the median is lower than the mean because of some considerably higher wages of; 2550 and 3500 which inflated the mean.

**6.3 **//Do you think it is better to use the mean or the median when examining the income of a group? //
We argue that it is better to examine the mean when examining the income of a group but this does not mean that the median is disregarded completely as explained below:

The median only informs about the income which is at the middle of the distribution yet the mean takes into consideration all values in the distribution thus it deviates from the mid-point. This makes it a better option compared to the mean. We do however, acknowledge that the mean is vulnerable to outliers unlike the median and to limit this affecting the results, we argue that the median can be employed alongside the mean to countercheck the results. We believe that employing more than one method for a single variable increases measurement validity.

In addition, the wages are interval /ratio variables thus using the mean to examine them is appropriate, (Bryman, 2008, p.325). Similarly, employing the median to counter-check the mean is fitting since it can be employed for both the interval /ratio and ordinal variables. =<range type="comment" id="756443">Question 7: Factors influencing wages =

<range type="comment" id="458075">In attempting question 7.1 and 7.2, we chose employ two methods: contingency tables and the spearman’s rho. The contingency tables were used to examine the relationship and reveal patterns of association between the variables while the spearman’s rho method was used to determine the correlation efficient (strength) and the statistical significance. This decision was informed by Bryman, (2008, p.326) observation that while contingency tables are probably the most flexible method of analyzing relationships, they are not always the most efficient method. The methods were used in a mutually supportive manner in the final analysis.

Both questions involved interval and dichotomous variables. In question 7.1, the variables involved were sex (dichotomous) and wages (interval/ratio) while question 7.2 involved wages (interval) and birth place- categorized into rural and urban thus dichotomous.

B<range type="comment" id="185984">efore making the calculations however, the variables sex and birthplace were re-coded as follows: Sex was categorized into: Female=1, Male=2, and birthplace into: Rural=1, Urban=2, thus making the latter a dichotomous variable. This was aimed at creating numeric variables.

<range type="comment" id="878217">7.1 How strongly does a person's sex affect their wage?
Table 7: Contingency table showing the relationship between sex and wage There are more females than males in the lowest <range type="comment" id="697297">age group and vice versa.
 * **Wage group * Sex Cross tabulation** ||
 * |||| **Sex** || **Total** ||
 * ^  || Female || Male ||^   ||
 * Wage group || 499 and under || Count || 43 || 24 || 67 ||
 * ^  ||^   || % within Sex || 45.7% || 33.3% || 40.4% ||
 * ^  || 500-999 || Count || 48 || 43 || 91 ||
 * ^  ||^   || % within Sex || 51.1% || 59.7% || 54.8% ||
 * ^  || 1000-1499 || Count || 1 || 4 || 5 ||
 * ^  ||^   || % within Sex || 1.1% || 5.6% || 3.0% ||
 * ^  || 1500-1999 || Count || 1 || 0 || 1 ||
 * ^  ||^   || % within Sex || 1.1% || .0% || .6% ||
 * ^  || 2500 and over || Count || 1 || 1 || 2 ||
 * ^  ||^   || % within Sex || 1.1% || 1.4% || 1.2% ||
 * Total || Count || 94 || 72 || 166 ||
 * ^  || % within Sex || 100.0% || 100.0% || 100.0% ||

Table 8: Spearman's rho showing the relationship between sex and wage The correlation coefficient between age and wage is 0.204 which is statistically significant at p<0.01. The level of correlation though positive is rather weak.
 * Correlations ||
 * || Sex || Wage ||
 * Spearman's rho || Sex || Correlation Coefficient || 1.000 || .204 ||
 * ^  ||^   || Sig. (2-tailed) || . || .008 ||
 * ^  ||^   || N || 166 || 166 ||
 * ^  || Wage || Correlation Coefficient || .204 || 1.000 ||
 * ^  ||^   || Sig. (2-tailed) || .008 || . ||
 * ^  ||^   || N || 166 || 166 ||
 * **. Correlation is significant at the 0.01 level (2-tailed). ||

The correlation is also statistically significant at p<0.01 (in the table one can see that it is statistically significant at 0,008), which means that there is less than 1 in 100 chances that the relationship between sex and wages is assumed when it does not exist. With such a high level of statistical significance, one can with great confidence say that the correlation is real, and that this can be generalized to the broader population from which the sample was selected. The risk of assuming that there is a relationship between sex and wage when there in fact is none, is very small (Bryman, 2008, p.334).

<range type="comment" id="446748">7.2 How strongly does it matter whether they live in an urban or a rural area?
Table 9: Contingency table showing relationship between birthplace and wage group: There are more people in the lowest wage group in rural areas than it is in urban areas.
 * **Wage group * Birthplace Cross tabulation** ||
 * |||| **Birthplace** || **Total** ||
 * ^  || rural || urban ||^   ||
 * Wage group || 499 and under || Count || 45 || 22 || 67 ||
 * ^  ||^   || % within Birthplace || 58.4% || 24.7% || 40.4% ||
 * ^  || 500-999 || Count || 31 || 60 || 91 ||
 * ^  ||^   || % within Birthplace || 40.3% || 67.4% || 54.8% ||
 * ^  || 1000-1499 || Count || 1 || 4 || 5 ||
 * ^  ||^   || % within Birthplace || 1.3% || 4.5% || 3.0% ||
 * ^  || 1500-1999 || Count || 0 || 1 || 1 ||
 * ^  ||^   || % within Birthplace || .0% || 1.1% || .6% ||
 * ^  || 2500 and over || Count || 0 || 2 || 2 ||
 * ^  ||^   || % within Birthplace || .0% || 2.2% || 1.2% ||
 * Total || Count || 77 || 89 || 166 ||
 * ^  || % within Birthplace || 100.0% || 100.0% || 100.0% ||

Table 10: Spearman’s rho showing the relationship between birthplace (rural/urban) and wage: The correlation coefficient between wage and birthplace is 0. 417, which is statistically significant at p<0.01. The level of correlation is positive indicating a medium/moderate relationship between variables.
 * **Correlations** ||
 * || **Birthplace** || **Wage** ||
 * Spearman's rho || Birthplace || Correlation Coefficient || 1.000 || .417 ||
 * ^  ||^   || Sig. (2-tailed) || . || .000 ||
 * ^  ||^   || N || 166 || 166 ||
 * ^  || Wage || Correlation Coefficient || .417 || 1.000 ||
 * ^  ||^   || Sig. (2-tailed) || .000 || . ||
 * ^  ||^   || N || 166 || 166 ||
 * **. Correlation is significant at the 0.01 level (2-tailed). ||

The correlation is also statistically significant at the level of p< 0.01. This means that there is less than 1 in 100 chances that the relationship between birthplace and wages is assumed when it does not exist. With such a high level of statistical significance, one can with great confidence say that the correlation is real, and that this can be generalized to the broader population from which the sample of 166 employees was selected. The risk of assuming that there is a relationship between birth place and wage when there in fact is none, is very small (Bryman, 2008, p.334).

7.3 Do you think the relationship between a person's sex or location and their wages shows correlation, causation, coincidence, or nothing at all?
<range type="comment" id="728834">Based on the measurements and analyses above, clearly, the relationships between sex and location and the wage show correlation as both have positive correlation coefficients. However, the relationship between birth place and wage (0.417) is stronger than that between sex and wage (which was on 0,204). It can thus be concluded that the location (rural/urban) has a stronger impact on wage than the impact that sex has on wage- nonetheless both matter to some degree.

Even though patterns of association are evident between variables, it cannot be said that the relationships between is causation. This is because the methods used; Spearman’s rho and contingency tables only illustrate relationships, not causality, (Bryman, 2008, <range type="comment" id="705103">p.36).

Furthermore, the levels of correlation are rather weak in the relationship between sex and wage and moderate in the relationship between wage and birth place implying that at least one, probably many, other factors play a role in explaining the relationships. However, the relationships cannot be said to be a coincidence because the level statistical significance is high for all of them - above the generally acceptable level, minimizing the probability of the relationships being caused by chance.

One should also ask if there is a possibility that the relationship between sex and wage or location and wage can be e.g. a spurious relationship, with a third intervening variable which makes it seem like there is a relationship when there in fact is no such direct relationship. Instead of being correlated, there might be a causative variable which affects the relationship between the other two. When controlling for such a third variable, one can risk that much of the apparent correlation between sex and wage or location and wage disappears. This would be interesting for further research, applying multivariate analysis. =Reference:=

Bryman, A. (2008). //Social Research Methods// (3rd Edition). Oxford: Oxford University Press

[N1] This part still remains as it was in the first draft- I did not see any comments.