Code
# -*- coding: utf-8 -*-
"""
Created on Sat Dec 12 14:11:38 2015
@author: Abhishek
"""
import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%.2f'%x)
data = pandas.read_csv('gapminder.csv')
# convert to numeric format
# Response Variable
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')
# Explanatory Variables
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce')
data['co2emissions'] = pandas.to_numeric(data['co2emissions'], errors='coerce')
data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce')
# listwise deletion of missing values
sub1 = data[['lifeexpectancy', 'urbanrate','incomeperperson','employrate','co2emissions','alcconsumption']].dropna()
# first order (linear) scatterplot
scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", scatter=True, data=sub1)
plt.xlabel('Urban Population Rate')
plt.ylabel('Life Expectancy')
# fit second order polynomial
# run the 2 scatterplots together to get both linear and second order fit lines
scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", scatter=True, order=2, data=sub1)
plt.xlabel('Urban Population Rate')
plt.ylabel('Life Expectancy')
# center quantitative IVs for regression analysis
sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1['incomeperperson_c'] = (sub1['incomeperperson'] - sub1['incomeperperson'].mean())
sub1['employrate_c'] = (sub1['employrate'] - sub1['employrate'].mean())
sub1['co2emissions_c'] = (sub1['co2emissions'] - sub1['co2emissions'].mean())
sub1['alcconsumption_c'] = (sub1['alcconsumption'] - sub1['alcconsumption'].mean())
# linear regression analysis
reg1 = smf.ols('lifeexpectancy ~ urbanrate_c', data=sub1).fit()
print (reg1.summary())
reg2 = smf.ols('lifeexpectancy ~ urbanrate_c + incomeperperson_c + employrate_c + co2emissions_c + alcconsumption_c ', data=sub1).fit()
print (reg2.summary())
reg3 = smf.ols('lifeexpectancy ~ urbanrate_c + I(urbanrate_c**2) + employrate_c', data=sub1).fit()
print (reg3.summary())
#Q-Q plot for normality
fig4=sm.qqplot(reg1.resid, line='r')
# simple plot of residuals
stdres=pandas.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
# additional regression diagnostic plots
fig2 = plt.figure(figsize(10,5))
fig2 = sm.graphics.plot_regress_exog(reg3, "employrate_c", fig=fig2)
# leverage plot
fig3=sm.graphics.influence_plot(reg3, size=8)
print(fig3)
Results
From the p value and coefficient it seems like there
is strong linear relationship between life expectancy and urban population
rate. A quadratic or higher order variable is not required in the regression
test. The code above does show the second variable added to the OLS regression
function for academic purpose only. Further test with other explanatory
variable are conducted to test for confounding.
The q-q plot somewhat follows a straight line but deviates considerably at lower and higher quantiles. There could be other explanatory variables that might strengthen the association between life expectancy and urban population rate.
Most of the residuals in the above graph fall between one
standard deviation of the mean, between -1 and 1. Visually we can ascertain that
about 95% of the residuals falls between 2 standard deviation of the mean.
Since there are residual values beyond 2 standard deviation of mean on either
side, it is warning that there are outliers and model is not a great fit.
No comments:
Post a Comment