Saturday, October 31, 2015

Running an analysis of variance

I am using the Gapminder dataset and my response variable is FemaleEmploymentRate and Explanatory variable is Polityscore.

My hypothesis is that female employment rate is related to polity score. Polity score captures the regime authority spectrum on a 21-point scale ranging from -10 (hereditary monarchy) to +10 (consolidated democracy). Polity score is the category variable with 21 possible categories.

I have chosen to look at just the G20 countries for my research. The data set is managed accordingly.


Code


"""
Created on Fri Oct 30 06:50:52 2015 
@author: Abhishek

"""
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pandas.read_csv('gapminder.csv', low_memory=False)
pandas.set_option('display.float_format', lambda x: '%f'%x)

data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)

dataG20Copy = data[(data['country'] == 'Argentina') |
            (data['country'] == 'Australia') |
            (data['country'] == 'Brazil') |
            (data['country'] == 'Canada') |
            (data['country'] == 'China') |
            (data['country'] == 'France') |
            (data['country'] == 'Germany') |
            (data['country'] == 'India') |
            (data['country'] == 'Indonesia') |
            (data['country'] == 'Italy') |
            (data['country'] == 'Japan') |
            (data['country'] == 'Mexico') |
            (data['country'] == 'Russia') |
            (data['country'] == 'Saudi Arabia') |
            (data['country'] == 'South Africa') |
            (data['country'] == 'Korea, Rep.') |
            (data['country'] == 'Turkey') |
            (data['country'] == 'United Kingdom') |
            (data['country'] == 'United States')]


# Not always necessary but can eliminate a setting with copy warning that is displayed
dataG20 = dataG20Copy.copy()

subPolity = dataG20[['femaleemployrate','polityscore']].dropna()

modelPolity = smf.ols(formula='femaleemployrate ~ C(polityscore)',data=subPolity).fit()
print(modelPolity.summary())

mean = subPolity.groupby('polityscore').mean()
print(mean)

sd = subPolity.groupby('polityscore').std()
print(sd)

mc1 = multi.MultiComparison(subPolity['femaleemployrate'],subPolity['polityscore'])
res1 = mc1.tukeyhsd()
print(res1.summary())


OLS Test Results


Group Means


Looking at the p value, we see that there is good chance that the null hypothesis can be rejected. Post Hoc test results will determine for which categories null hypothesis can be rejected.

Turn Key HSD / Post Hoc Test Results


The Groups for which reject column is True in the above results are groups where NULL Hypothesis can be safely rejected. 

In conclusion, it is evident that Female Employment Rate is indeed dependent on Polity score of a G20 country.

No comments:

Post a Comment