Sunday, February 21, 2016

Lasso Regression

Data Set
I am using Gapminder dataset for the purpose of running Random Forest experiment

Data Management
Similar to previous assignment, I am using binary categorical response variable, ‘polityscore’ and a number of quantitative explanatory variables. The variable polity score ranges from -10 to 10. -10 to -5 is considered Autocratic. -5 to 5 is considered Anocratic. Countries falling in this range is usually politically neutral. 5 to 10 is Democratic. The variable is re-coded such that if a country is Autocratic or Democratic, then the value is 1. Otherwise the value is 0. In LR penalty term is not fair if predictor variables are not on the same scale. That means all the predictors get the same penalty. To remedy this, I will standardize all predictors to have mean = 0 and SD = 1.

Lasso Regression


Variable names and regression coefficients
From the list above we can determine which predictors have shrunk to 0 after applying penalty and subsequently removed from the model. We see that URBANRATE has shrunk to 0. FEMALEEMPLOYRATE and INTERNETUSERATE have strong positive correlation with with target variable and EMPLOYRATE and ARMEDFORCERATE have strong negative correlation with target variable.
This plot shows the relative importance of the predictors at any step of the selection process and they were entered in the model.Comparing against the variable name and regression coefficient list, we see that ARMEDFORCERATE and then LIFEEXPECTANCY enter the model in the beginning. Further down the road, EMPLOYRATE and FEMALEEMPLOYRATE is entered in the model. I assume the reason why armedforcerate and lifeexpectancy predictors were entered first even though they do not have the largest regression coefficient is because initially they had a sharp coefficient slope.
The above plot shows the change in the Mean Square Error vs change in the penalty parameter alpha at each step in the selection process. The average mean square error across all cross validation fold is plotted as the thick black line.
As expected the selected model was less accurate than the test data. Test mean square error was close to training mean square error. This suggests that prediction accuracy was stable across the 2 data sets. The R square values were 0.31 and 0.44, indicating that the training and test data set explained 33 and 44 percent of variance in the polity score bias of a country in the training and test sets respectively.

Code


# -*- coding: utf-8 -*-
"""
Created on Sun Feb 21 11:52:01 2016

@author: Abhishek
"""

#from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
 
#Load the dataset
data = pd.read_csv('gapminder.csv')


# Target Variable
data['polityscore'] = pd.to_numeric(data['polityscore'], errors='coerce')

# Predictor Variables
data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce')
data['femaleemployrate'] = pd.to_numeric(data['femaleemployrate'], errors='coerce')
data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce')
data['armedforcesrate'] = pd.to_numeric(data['armedforcesrate'], errors='coerce')
data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pd.to_numeric(data['urbanrate'], errors='coerce')
data['suicideper100th'] = pd.to_numeric(data['suicideper100th'], errors='coerce')

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management
data_clean = data.dropna()

# Function that converts Polity score to a binary variable.
def RecodePolityScore (row):
   if row['polityscore']<=-5 or row['polityscore']>=5 :
      return 1
   elif row['polityscore']>-5 and row['polityscore']<5 :
      return 0

#select predictor variables and target variable as separate data sets  
predvar = data_clean[[
'INCOMEPERPERSON',
'INTERNETUSERATE',
'FEMALEEMPLOYRATE',
'ARMEDFORCESRATE',
'EMPLOYRATE',
'LIFEEXPECTANCY',
'URBANRATE',
'SUICIDEPER100TH'
]]

target = data_clean.POLITYSCORE
 
# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
from sklearn import preprocessing
predictors['INCOMEPERPERSON']=preprocessing.scale(predictors['INCOMEPERPERSON'].astype('float64'))
predictors['INTERNETUSERATE']=preprocessing.scale(predictors['INTERNETUSERATE'].astype('float64'))
predictors['FEMALEEMPLOYRATE']=preprocessing.scale(predictors['FEMALEEMPLOYRATE'].astype('float64'))
predictors['ARMEDFORCESRATE']=preprocessing.scale(predictors['ARMEDFORCESRATE'].astype('float64'))
predictors['EMPLOYRATE']=preprocessing.scale(predictors['EMPLOYRATE'].astype('float64'))
predictors['LIFEEXPECTANCY']=preprocessing.scale(predictors['LIFEEXPECTANCY'].astype('float64'))
predictors['URBANRATE']=preprocessing.scale(predictors['URBANRATE'].astype('float64'))
predictors['SUICIDEPER100TH']=preprocessing.scale(predictors['SUICIDEPER100TH'].astype('float64'))

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, 
                                                              test_size=.3, random_state=123)
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
print(dict(zip(predictors.columns, model.coef_)))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')


# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
         

# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)
 

No comments:

Post a Comment