Questions Graphs and Numbers: November 2015

Sunday, November 29, 2015

About my Data

Sample

Data set used so far in my previous courses (Data Management and Visualization and Data Analysis Tools) is Gapminder. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas. GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

Procedures

The variables that I have used from this data set so far are – female employment rate, polity score, income per person, urban population rate, internet usage rate. For several assignments I have selected data for above indicators for just the G20 countries

Female Employment Rate

Percentage of female population, age above that has been employed during the given year.
Source: International Labor Organization
Observational
Complete reference link

Polity Score:

Democracy score (based on Polity IV)
Source: Polity IV Project: Political Regime Characteristics and Transitions, 1800-2009. Overall polity score from the Polity IV dataset
Observational
Complete reference Link

Income per Person:

Gross Domestic Product per capita by Purchasing Power Parities (in international dollars, fixed 2011 prices). The inflation and differences in the cost of living between countries has been taken into account
Sources: Cross-country data for 2011 is mainly based on the 2011 round of the International Comparison Program. Estimates based on other sources were used for the other countries. Real growth rates were linked to the 2011 levels. Several sources are used for these growth rates, such as the data of Angus Maddison. Follow the link below to download the detailed documentation.
Experimental
Complete reference link

Urban Population Rate:

Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects. Source: World Bank Staff estimates based on United Nations, World Urbanization Prospects.
Source: World Bank
Observational
Complete reference link

Internet Use Rate

Internet users are people with access to the worldwide network (per 100 people)
Source: World Bank
Complete reference link

Measures

Female Employment Rate: 2007 female employees age 15+ (% of population) Percentage of female population, age above 15, that has been employed during the given year
Polity Score: 2009 Democracy score (Polity) Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. The summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest
Income per Person: 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account
Urban Population: 2008 urban population (% of total) Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects)
Internet Use Rate: 2010 Internet users (per 100 people) Internet users are people with access to the worldwide network.

References

Gapminder Code book
Gapminder

Wednesday, November 25, 2015

Exploring Statistical Interactions

The data set in question is Gapminder. I am looking at relationship between urban population rate and internet usage. I am interested in seeing if polity score of a country moderates the relationship between urban rate and internet usage.

Code



# -*- coding: utf-8 -*-
"""
Created on Sun Nov 15 18:56:59 2015

@author: Abhishek
"""
import numpy
import pandas
import statsmodels.formula.api as smf 
import statsmodels.stats.multicomp as multi
import seaborn
import matplotlib.pyplot as plt
import scipy.stats

data = pandas.read_csv('gapminder.csv', low_memory=False)
pandas.set_option('display.float_format', lambda x: '%f'%x)

data['urbanrate'] = pandas.to_numeric(data['urbanrate'],errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'],errors='coerce')
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

data['polity_cat'] = pandas.cut(data.polityscore, [-10, -6, 5, 10], labels=['autocracy', 'anocracy', 'democracy'])
data['urbanrate_cat'] = pandas.cut(data.urbanrate, [10,40,70,100], labels=['sparse', 'moderate','dense'])

data['polity_cat'] =data['polity_cat'].astype(numpy.object)
data['polity_cat'] = data['polity_cat'].replace(' ',numpy.nan)

data['urbanrate_cat'] =data['urbanrate_cat'].astype(numpy.object)
data['urbanrate_cat'] = data['urbanrate_cat'].replace(' ',numpy.nan)

data['internetuserate_cat'] = pandas.cut(data.internetuserate, 3, labels=['low', 'moderate','high'])
data['internetuserate_cat'] =data['internetuserate_cat'].astype(numpy.object)
data['internetuserate_cat'] = data['internetuserate_cat'].replace(' ',numpy.nan)

sub2=data[(data['polity_cat']=='autocracy')]
sub3=data[(data['polity_cat']=='anocracy')]
sub4=data[(data['polity_cat']=='democracy')]

#%%

# ANOVA

model1 = smf.ols(formula='internetuserate ~ C(urbanrate_cat)', data=data).fit()
print (model1.summary())

print("Means of Polity Scores")
sub1 = data[['internetuserate', 'urbanrate_cat']].dropna()
m1 = sub1.groupby('urbanrate_cat').mean()
print(m1)

print("Standard Deviation for mean Polity score")
st1 = sub1.groupby('urbanrate_cat').std()
print(st1)

# bivariate bar graph
seaborn.factorplot(x="urbanrate_cat", y="internetuserate", data=data, kind="bar", ci=None)
plt.xlabel('Urban Population Rate')
plt.ylabel('Internet Usage Rate')


#%%

print("==========================================================================")
print("==========================================================================")
print()
print()

print ('association between urbanrate and internet usage for autocratic countries')
model2 = smf.ols(formula='internetuserate ~ C(urbanrate_cat)', data=sub2).fit()
print (model2.summary())

print("Standard Deviation for mean Polity score")
st1 = sub1.groupby('urbanrate_cat').std()
print(st1)

# bivariate bar graph
seaborn.factorplot(x="urbanrate_cat", y="internetuserate", data=sub2, kind="bar", ci=None)
plt.xlabel('Urban Population Rate')
plt.ylabel('Internet Usage Rate')

print ('association between urbanrate and internet usage for anocratic countries')
model3 = smf.ols(formula='internetuserate ~ C(urbanrate_cat)', data=sub3).fit()
print (model3.summary())

# bivariate bar graph
seaborn.factorplot(x="urbanrate_cat", y="internetuserate", data=sub3, kind="bar", ci=None)
plt.xlabel('Urban Population Rate')
plt.ylabel('Internet Usage Rate')

print ('association between urbanrate and internet usage for democratic countries')
model3 = smf.ols(formula='internetuserate ~ C(urbanrate_cat)', data=sub4).fit()
print (model3.summary())

# bivariate bar graph
seaborn.factorplot(x="urbanrate_cat", y="internetuserate", data=sub4, kind="bar", ci=None)
plt.xlabel('Urban Population Rate')
plt.ylabel('Internet Usage Rate')

print("==========================================================================")
print("==========================================================================")


#%%

# Chi Sqaure Test of independence

print("Chi Square Test of Independence: Internet usage rate vs urban population rate")
print("-----------------------------------------------------------------------------")

ct=pandas.crosstab(data['internetuserate_cat'], data['urbanrate_cat'])

# column sum
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')

print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)
print()

print("Chi Square Test of Independence: Test for moderation for Autocratic countries")
print("-----------------------------------------------------------------------------")

ct=pandas.crosstab(sub2['internetuserate_cat'], sub2['urbanrate_cat'])
#%%

# column sum
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')

print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)
print()

print("Chi Square Test of Independence: Test for moderation for Anocratic countries")
print("-----------------------------------------------------------------------------")

ct=pandas.crosstab(sub3['internetuserate_cat'], sub3['urbanrate_cat'])
#%%

# column sum
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')

print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)
print()

print("Chi Square Test of Independence: Test for moderation for Democratic countries")
print("-----------------------------------------------------------------------------")

ct=pandas.crosstab(sub4['internetuserate_cat'], sub4['urbanrate_cat'])
#%%

# column sum
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')

print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)
print()

#%%


# Pearson Correlation

data['internetuserate'] = data['internetuserate'].replace(' ',numpy.nan)
data['urbanrate'] = data['urbanrate'].replace(' ',numpy.nan)
data['polityscore'] = data['polityscore'].replace(' ',numpy.nan)

data_clean=data.dropna()

print ('Pearson Correlation: urbanrate and internetusage')
print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['internetuserate']))

def polityscoregrp (row):
   if row['polityscore'] <= -5:
      return 1
   elif row['polityscore'] <= 5 :
      return 2
   elif row['polityscore'] < 10:
      return 3
   
data_clean['polityscore_grp'] = data_clean.apply (lambda row: polityscoregrp (row),axis=1)

chk1 = data_clean['polityscore_grp'].value_counts(sort=False, dropna=False)
print(chk1)

sub2=data_clean[(data_clean['polityscore_grp']== 1)]
sub3=data_clean[(data_clean['polityscore_grp']== 2)]
sub4=data_clean[(data_clean['polityscore_grp']== 3)]

sub2_clean = sub2.dropna()
sub3_clean = sub3.dropna()
sub4_clean = sub4.dropna()

print('--------------------------------------------------')
print ('Test of moderation by polityscorePearson Correlation: urbanrate and internetusage ')
print (scipy.stats.pearsonr(sub2_clean['urbanrate'], sub2_clean['internetuserate']))
print (scipy.stats.pearsonr(sub3_clean['urbanrate'], sub3_clean['internetuserate']))
print (scipy.stats.pearsonr(sub4_clean['urbanrate'], sub4_clean['internetuserate']))

#%%
scat2 = seaborn.regplot(x="internetuserate", y="urbanrate", data=sub2_clean)
plt.xlabel('Internet Use Rate')
plt.ylabel('Urban Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Autocratic countries')
print (scat2)

#%%
scat3 = seaborn.regplot(x="internetuserate", y="urbanrate", data=sub3_clean)
plt.xlabel('Internet Use Rate')
plt.ylabel('Urban Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Anocratic countries')
print (scat3)

#%%
scat4 = seaborn.regplot(x="internetuserate", y="urbanrate", data=sub4_clean)
plt.xlabel('Internet Use Rate')
plt.ylabel('Urban Rate')
plt.title('Scatterplot for the Association Between Urban Rate and Internet Use Rate for Democratic countries')
print (scat4)

Results

ANOVA

To conduct ANOVA, the urbanrate variable has been categorized into sparse, moderate and dense, signifying sparsely populated, moderately populated and densely populated.

The large F-Statistic and very small p value show that the null hypothesis can be rejected in this case. There is a relationship between urban population rate and internet usage. The Means show that internet usage increases with population density. The bar graph reiterates the relationship between Internet Usage and Urban Population rate.

ANOVA Test of Moderation by Polity Score

From the above ANOVA results we see that when polity score of a country is in between -5 and 5, that is the country is autocratic, the variable does moderate the relationship between urban population rate and internet usage.

Chi Square Test of Independence

For Chi Square Test of Independence the Internet Usage Rate is categorized in Low, Moderate and High. Chi Square Test is run on internet usage rate and categorical variable with low, moderate and high categories versus urban population rate with categories sparse, moderate and dense. The results are as follows

The test without introducing a moderating variable shows a large chi square value and very small p value as expected. Since null hypothesis can be rejected and assume that there is relationship between the two categories.

Chi Square Test of Moderation by Polity Score

The Chi Square Test with polity score used as moderating variable shows that for autocratic countries the null hypothesis cannot be rejected. Hence the polity score moderates the result when it falls in autocratic range. These results are in keeping with Anova results.

Pearson Correlation

This test looks at the correlation between urban population rate and internet usage. The polity score which is the moderating variable, is grouped in three groups to ease the testing of moderation. The three groups are numbered 1, 2, 3. 1 being range of -10 to -5, 2 is range -5 to 5 and 3 is from 5 to 10. A country falling in range 3 is democratic.

The p value as shown above is quite small and hence we concur that there is a relationship between internet usage and urban population density. The cross tab function table shows the counts in each groups 1,2 and 3.

Correlation moderated by Polity Score

The results show for Autocratic countries that is countries that fall in the range 1 definitely moderate the correlation result of urban population rate and internet usage.

In conclusion, we can see that polity score does moderate the results of ANOVA, Chi Square Test of Independence and Correlation of urban population rate and internet usage when polity score is autocratic.

Saturday, November 14, 2015

Calculating Correlation

I have been using the Gapminder data set for my research questions. Unlike my previous submissions, where I have been looking at female employment rate, here I am looking at correlation between life expectancy vs urban rate.

Code


# -*- coding: utf-8 -*-
"""
Created on Thu Nov 12 14:06:59 2015
@author: Abhishek
"""
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)
pandas.set_option('display.float_format', lambda x: '%f'%x)

data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
'''
scat1 = seaborn.regplot(x='urbanrate', y='lifeexpectancy', fit_reg=True,data=data)
plt.xlabel('Urban Rate')
plt.ylabel('Life Expentancy')
plt.title('Scatterplot Urban Rate VS Life Expectancy')
'''
data['lifeexpectancy'] = data['lifeexpectancy'].replace(' ',numpy.nan)
data['urbanrate'] = data['urbanrate'].replace(' ',numpy.nan)

data_clean=data.dropna()

print ('association between urbanrate and lifeexpectancy')
print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

Results

The p Value is significant and correlation coefficient is 0.61870

The scatter plot above shows that there is a positive linear relationship between lifeexpectance and urbanrate. As shown by the correlation number, the relationship is of modest strength.

The relationship is statistically significant. It is likely that countries with higher urban population have higher life expectancy.

Squaring r, give us 0.38. This means that if we know the x variable in scatter plot, in this case urbanrate, then we can predict 38% of life expectancy. 62% is unaccounted for.

Thursday, November 12, 2015

Chi Square Test of Independence

In case of Chi Square test of independence I will use the same research question as last post, which is if female employment rate is associated with polity score of a country. The NULL hypothesis being: there is no association between female employment rate and polity score. In the GAPMINDER data set both these fields are numeric. In order to do a Chi Square test I had to categorize these fields.

Polity Score Categories -

Autocracies: -10 to -6

Anocracies: -5 to 5

Democracies: 6 to 10

The categories are obtained from the definition of Polity score in Wikipedia.

Code


# -*- coding: utf-8 -*-
"""
Created on Sun Nov  8 13:54:31 2015

@author: Abhishek
"""
import pandas
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)
pandas.set_option('display.float_format', lambda x: '%f'%x)

data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)

data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')
data['polity_cat'] = pandas.cut(data.polityscore, [-10, -6, 5, 10], labels=['autocracy', 'anocracy', 'democracy'])

data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce')
data['femaleemployrate_cat'] = pandas.cut(data.femaleemployrate, [0, 25, 75, 100], labels=['low', 'medium', 'high'])

ct=pandas.crosstab(data['femaleemployrate_cat'], data['polity_cat'])
print(ct)
print('--------------------------------------------------------------')

# column sum
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')

# Chi Square
print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)

data["polity_cat"] =data["polity_cat"].astype('category')
data["femaleemployrate_cat"] = data["femaleemployrate_cat"].convert_objects(convert_numeric=True)

plt.xlabel('Polity Score')
plt.ylabel('Female Employment Rate')
seaborn.factorplot(x="polity_cat",y="femaleemployrate",data=data,kind="bar",ci=None)

print()
print()

# Post Hoc Test
print('---------------------------------------------------------------')
print('                      Post Hoc Test                            ')
print('---------------------------------------------------------------')

recode = {'autocracy': 'autocracy', 'anocracy': 'anocracy' }
data['comp1'] = data['polity_cat'].map(recode)
ct = pandas.crosstab(data['femaleemployrate_cat'], data['comp1'])
print(ct)
print('--------------------------------------------------------------')
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')

print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)
print('--------------------------------------------------------------')
print('--------------------------------------------------------------')

recode = {'autocracy': 'autocracy', 'democracy':'democracy' }
data['comp2'] = data['polity_cat'].map(recode)
ct = pandas.crosstab(data['femaleemployrate_cat'], data['comp2'])
print(ct)

print('--------------------------------------------------------------')
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')
print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)
print('--------------------------------------------------------------')
print('--------------------------------------------------------------')

recode = {'anocracy': 'anocracy', 'democracy':'democracy' }
data['comp3'] = data['polity_cat'].map(recode)
ct = pandas.crosstab(data['femaleemployrate_cat'], data['comp3'])
print(ct)

print('--------------------------------------------------------------')
colsum = ct.sum(axis=0)
colpct = ct/colsum
print(colpct*100)
print('--------------------------------------------------------------')
print('chi square value, p value, expected count')
cs = scipy.stats.chi2_contingency(ct)
print(cs)

Initial result of Chi Square test

This result shows the crosstab function of female employment rate categorized as High, Medium and Low against Polity score categorized as autocracy, anocracy and democracy. We are looking at column percentages. Our initial results show that in Democracy column the female employment rate is significantly higher in the row labeled as ‘Medium’. The p Value 0.0003271499 is small enough to reject the NULL hypothesis. However we have more than one category for explanatory variable Polity score. We have 3 groups – autocracy, anocracy and democracy, hence the chi square and p value does not give us insight into why null hypothesis can be rejected. A post hoc test is required. Since there are 3 categories of the explanatory variable, we are going to make 3 comparisons. The Bonferroni adjusted p value 0.017 (0.5/3).

Results of Post Hoc test

Since there are 3 categories of the explanatory variable, 3 comparisons are made.

Anocracy VS Autocracy: The Chi Square Test value is 1.685 and p Value is 0.43, which is greater than 0.017. Hence we cannot reject the NULL hypothesis for this comparison.

Autocracy VS Democracy: The Chi Square Test value is 10.64 and p Value is 0.004885. The p Value here is less than 0.017 hence we can reject the NULL hypothesis and accept the alternative hypothesis. This means compared to autocratic countries, democratic countries have a higher female employment rate.

Anocracy VS Democracy: The Chi Square Test value is 17.52139 and p Value is 0.000157. The p Value here again is significantly less than 0.017 and hence we ca safely reject the NULL hypothesis.

Conclusion

Chi Square test shows that polity score is in fact related to female employment rate. Democratic countries have a significantly higher female employment rate than Anocratic countries.