Sunday, December 20, 2015

Test Multiple Regression

I am using Gapminder data set. My hypothesis is Life Expectancy is associated with Urban population rate in a positive way. Life expectancy is the response variable and Urban Population rate is the explanatory variable. I will be testing if any confounding exists with variables – income per person, employment rate, co2 emission and alcohol consumption

Code


# -*- coding: utf-8 -*-
"""
Created on Sat Dec 12 14:11:38 2015

@author: Abhishek
"""

import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn

# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv('gapminder.csv')

# convert to numeric format

# Response Variable
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')

# Explanatory Variables
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce')
data['co2emissions'] = pandas.to_numeric(data['co2emissions'], errors='coerce')
data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce')

# listwise deletion of missing values
sub1 = data[['lifeexpectancy', 'urbanrate','incomeperperson','employrate','co2emissions','alcconsumption']].dropna()

# first order (linear) scatterplot
scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", scatter=True, data=sub1)
plt.xlabel('Urban Population Rate')
plt.ylabel('Life Expectancy')

# fit second order polynomial
# run the 2 scatterplots together to get both linear and second order fit lines
scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", scatter=True, order=2, data=sub1)
plt.xlabel('Urban Population Rate')
plt.ylabel('Life Expectancy')

# center quantitative IVs for regression analysis
sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean())
sub1['incomeperperson_c'] = (sub1['incomeperperson'] - sub1['incomeperperson'].mean())
sub1['employrate_c'] = (sub1['employrate'] - sub1['employrate'].mean())
sub1['co2emissions_c'] = (sub1['co2emissions'] - sub1['co2emissions'].mean())
sub1['alcconsumption_c'] = (sub1['alcconsumption'] - sub1['alcconsumption'].mean())

# linear regression analysis
reg1 = smf.ols('lifeexpectancy ~ urbanrate_c', data=sub1).fit()
print (reg1.summary())

reg2 = smf.ols('lifeexpectancy ~ urbanrate_c  + incomeperperson_c + employrate_c + co2emissions_c + alcconsumption_c ', data=sub1).fit()
print (reg2.summary())

reg3 = smf.ols('lifeexpectancy ~ urbanrate_c + I(urbanrate_c**2) + employrate_c', data=sub1).fit()
print (reg3.summary())

#Q-Q plot for normality
fig4=sm.qqplot(reg1.resid, line='r')


# simple plot of residuals
stdres=pandas.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')


# additional regression diagnostic plots
fig2 = plt.figure(figsize(10,5))
fig2 = sm.graphics.plot_regress_exog(reg3,  "employrate_c", fig=fig2)

# leverage plot
fig3=sm.graphics.influence_plot(reg3, size=8)
print(fig3)

Results



From the p value and coefficient it seems like there is strong linear relationship between life expectancy and urban population rate. A quadratic or higher order variable is not required in the regression test. The code above does show the second variable added to the OLS regression function for academic purpose only. Further test with other explanatory variable are conducted to test for confounding.



 R squared value: We did not find any confounding variables but controlling other variables in the OLS regression does increase the R square value. Which means, with more variables in the mix the variability in association between Life Expectancy and Urban Population rate is less

The q-q plot somewhat follows a straight line but deviates considerably at lower and higher quantiles. There could be other explanatory variables that might strengthen the association between life expectancy and urban population rate.
Most of the residuals in the above graph fall between one standard deviation of the mean, between -1 and 1. Visually we can ascertain that about 95% of the residuals falls between 2 standard deviation of the mean. Since there are residual values beyond 2 standard deviation of mean on either side, it is warning that there are outliers and model is not a great fit.



No comments:

Post a Comment