Saturday, October 31, 2015

Running an analysis of variance

I am using the Gapminder dataset and my response variable is FemaleEmploymentRate and Explanatory variable is Polityscore.

My hypothesis is that female employment rate is related to polity score. Polity score captures the regime authority spectrum on a 21-point scale ranging from -10 (hereditary monarchy) to +10 (consolidated democracy). Polity score is the category variable with 21 possible categories.

I have chosen to look at just the G20 countries for my research. The data set is managed accordingly.


Code


"""
Created on Fri Oct 30 06:50:52 2015 
@author: Abhishek

"""
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pandas.read_csv('gapminder.csv', low_memory=False)
pandas.set_option('display.float_format', lambda x: '%f'%x)

data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)

dataG20Copy = data[(data['country'] == 'Argentina') |
            (data['country'] == 'Australia') |
            (data['country'] == 'Brazil') |
            (data['country'] == 'Canada') |
            (data['country'] == 'China') |
            (data['country'] == 'France') |
            (data['country'] == 'Germany') |
            (data['country'] == 'India') |
            (data['country'] == 'Indonesia') |
            (data['country'] == 'Italy') |
            (data['country'] == 'Japan') |
            (data['country'] == 'Mexico') |
            (data['country'] == 'Russia') |
            (data['country'] == 'Saudi Arabia') |
            (data['country'] == 'South Africa') |
            (data['country'] == 'Korea, Rep.') |
            (data['country'] == 'Turkey') |
            (data['country'] == 'United Kingdom') |
            (data['country'] == 'United States')]


# Not always necessary but can eliminate a setting with copy warning that is displayed
dataG20 = dataG20Copy.copy()

subPolity = dataG20[['femaleemployrate','polityscore']].dropna()

modelPolity = smf.ols(formula='femaleemployrate ~ C(polityscore)',data=subPolity).fit()
print(modelPolity.summary())

mean = subPolity.groupby('polityscore').mean()
print(mean)

sd = subPolity.groupby('polityscore').std()
print(sd)

mc1 = multi.MultiComparison(subPolity['femaleemployrate'],subPolity['polityscore'])
res1 = mc1.tukeyhsd()
print(res1.summary())


OLS Test Results


Group Means


Looking at the p value, we see that there is good chance that the null hypothesis can be rejected. Post Hoc test results will determine for which categories null hypothesis can be rejected.

Turn Key HSD / Post Hoc Test Results


The Groups for which reject column is True in the above results are groups where NULL Hypothesis can be safely rejected. 

In conclusion, it is evident that Female Employment Rate is indeed dependent on Polity score of a G20 country.

Sunday, October 11, 2015

Data Visualization

The Code of visualization is provided below

 # -*- coding: utf-8 -*-  
 """  
 Created on Sun Oct 4 20:25:31 2015  
 @author: Abhishek  
 """  
 import pandas  
 import numpy  
 import seaborn  
 import matplotlib.pyplot as plt  
 data = pandas.read_csv('gapminder.csv', low_memory=False)  
 pandas.set_option('display.float_format', lambda x: '%f'%x)  
 data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)  
 data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)  
 data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)  
 dataG20Copy = data[(data['country'] == 'Argentina') |  
       (data['country'] == 'Australia') |  
       (data['country'] == 'Brazil') |  
       (data['country'] == 'Canada') |  
       (data['country'] == 'China') |  
       (data['country'] == 'France') |  
       (data['country'] == 'Germany') |  
       (data['country'] == 'India') |  
       (data['country'] == 'Indonesia') |  
       (data['country'] == 'Italy') |  
       (data['country'] == 'Japan') |  
       (data['country'] == 'Mexico') |  
       (data['country'] == 'Russia') |  
       (data['country'] == 'Saudi Arabia') |  
       (data['country'] == 'South Africa') |  
       (data['country'] == 'Korea, Rep.') |  
       (data['country'] == 'Turkey') |  
       (data['country'] == 'United Kingdom') |  
       (data['country'] == 'United States')]  
 # Not always necessary but can eliminate a setting with copy warning that is displayed  
 dataG20 = dataG20Copy.copy()  
 # Female Employment Rate  
 print('Describe Female Employee Rate of G20 countries')  
 desc1 = dataG20['femaleemployrate'].describe()  
 print(desc1)  
 seaborn.distplot(dataG20['femaleemployrate'].dropna(),kde=False);  
 plt.xlabel('Female Employment Rate')  
 plt.title('Female Employment Rate in G20 Countries')  
 # Income Per Person  
 print('Describe Female Employee Rate of G20 countries')  
 desc2 = dataG20['incomeperperson'].describe()  
 print(desc2)  
 seaborn.distplot(dataG20['incomeperperson'].dropna(),kde=False);  
 plt.xlabel('Income Per Person')  
 plt.title('Income Per Person in G20 Countries')  
 dataG20['polityscorecat'] = dataG20['polityscore'].astype('category')  
 seaborn.distplot(dataG20['polityscorecat'].dropna(),kde=False);  
 plt.xlabel('Polity Score')  
 plt.title('Polity Score of G20 Countries')  
 scat1 = seaborn.regplot(x="incomeperperson", y="femaleemployrate", fit_reg=False, data=dataG20)  
 print(scat1)  

The Uni-variate graph result







My research question was if income per person is related to female employment rate. It seems like though income female employment rate has a positive relationship to income per person in G20 countries, the relationship is weak.

Sunday, October 4, 2015

Data Management

It turns out that I have already done some data management in my previous assignment. While calculating the frequency distribution of female employment rate, I grouped the employment rates into 4 groups. And while calculating GDP, I applied similar grouping techniques. I feel the Gapminder data set is rather straight forward and does not require much management. Instead of random groups, I will break down female employment rate and GDP into quartiles for this assignment.
 # -*- coding: utf-8 -*-  
 """  
 Created on Sun Oct 4 20:25:31 2015  
 @author: Abhishek  
 """  
 import pandas  
 import numpy  
 data = pandas.read_csv('gapminder.csv', low_memory=False)  
 pandas.set_option('display.float_format', lambda x: '%f'%x)  
 data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)  
 data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)  
 data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)  
 dataG20Copy = data[(data['country'] == 'Argentina') |  
       (data['country'] == 'Australia') |  
       (data['country'] == 'Brazil') |  
       (data['country'] == 'Canada') |  
       (data['country'] == 'China') |  
       (data['country'] == 'France') |  
       (data['country'] == 'Germany') |  
       (data['country'] == 'India') |  
       (data['country'] == 'Indonesia') |  
       (data['country'] == 'Italy') |  
       (data['country'] == 'Japan') |  
       (data['country'] == 'Mexico') |  
       (data['country'] == 'Russia') |  
       (data['country'] == 'Saudi Arabia') |  
       (data['country'] == 'South Africa') |  
       (data['country'] == 'Korea, Rep.') |  
       (data['country'] == 'Turkey') |  
       (data['country'] == 'United Kingdom') |  
       (data['country'] == 'United States')]  
 # Not always necessary but can eliminate a setting with copy warning that is displayed  
 dataG20 = dataG20Copy.copy()  
 print('FEMALE EMP RATE: 4 Quartiles')  
 data['femaleemployrate4'] = pandas.qcut(data.femaleemployrate,4,labels=["1=25%tile","2=50%tile","3=75%tile","4=100%tile"])  
 qF = data['femaleemployrate4'].value_counts(sort=True,dropna=True,normalize=True) * 100  
 print(qF)  
 #print(pandas.crosstab(data['femaleemployrate'],data['femaleemployrate4']))  
 print('INCOME PER PERSON: 4 Quartiles')  
 dataG20['incomeperperson4'] = pandas.qcut(dataG20.incomeperperson,4,labels=["1=25%tile","2=50%tile","3=75%tile","4=100%tile"])  
 gdpQ = dataG20['incomeperperson4'].value_counts(sort=False,dropna=True, normalize=True) * 100  
 print(gdpQ)