Sunday, February 7, 2016

Decision Trees

Dataset

I am using the Gapminder dataset. Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable.

Data Management

The Gapminder dataset does not have any binary categorical variable. So, I have converted the variable ‘polityscore’ into a binary categorical variable. The variable polity score ranges from -10 to 10. -10 to -5 is considered Autocratic. -5 to 5 is considered Anocratic. Countries falling in this range is usually politically neutral. 5 to 10 is Democratic. The variable is recoded such that if a country is Autocratic or Democratic, then the value is 1. Otherwise the value is 0.

The Decision Tree

Above decision tree is constructed with response variable or target as polity score and 3 quantitative explanatory variables or predictors - internet use rate, female employment rate and urban population rate. Looking at the tree from top to bottom (depth first), the first split (first node) is based on X[0] <= 38.05. If internet use rate (X[0], or first explanatory variable) is less than 38.05% move to left branch. Otherwise go right. In the second level the split is based on the second explanatory variable (X[1]), female employment rate. In this fashion the tree grows recursively splitting on certain condition imposed on the explanatory variables. One can follow the tree in a depth first fashion and reach one of the leaves which shows the values of response variable as a result of the decisions made along the branches.
 Shape of the training and test samples

The above output shows the shape of the training and test samples. 87 observations, 60% of the dataset is used as training set. The remaining 40% or 59 observations are used as test set. There are 3 predictors or response variables used in this example.

Correct and Incorrect classification of Decision Tree

The diagonal, 11, 35 reflects the number of true negatives and true positives respectively. The diagonal 5, 8 reflects the false negatives and false positives for the variable polity score. Where value of 1 for the variable means the country is politically biased, i.e autocratic and democratic. 0 implies the country is neutral.

The accuracy of the model came up as 0.779661016949

Code


# -*- coding: utf-8 -*-
"""
Created on Sat Feb  6 23:33:14 2016

@author: Abhishek
"""
from pandas import Series, DataFrame
import pandas
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

#os.chdir("C:\Data Analysis and Interpretation")

"""
Data Engineering and Analysis
"""
#Load the dataset

data = pandas.read_csv('gapminder.csv')

# Replace empty values with NaN
data['polityscore'] = data['polityscore'].astype(np.object)
data['polityscore'] = data['polityscore'].replace(' ',np.nan)
data['polityscore'] = data['polityscore'].replace('',np.nan)

data['femaleemployrate'] = data['femaleemployrate'].astype(np.object)
data['femaleemployrate'] = data['femaleemployrate'].replace(' ',np.nan)
data['femaleemployrate'] = data['femaleemployrate'].replace('',np.nan)

data['internetuserate'] = data['internetuserate'].astype(np.object)
data['internetuserate'] = data['internetuserate'].replace(' ',np.nan)
data['internetuserate'] = data['internetuserate'].replace('',np.nan)

data['urbanrate'] = data['urbanrate'].astype(np.object)
data['urbanrate'] = data['urbanrate'].replace('',np.nan)
data['urbanrate'] = data['urbanrate'].replace(' ',np.nan)

# Target Variable
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

# Predictor Variables
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')

data_clean = data.dropna()

#print(data_clean.dtypes)
#print(data_clean.describe())

#%%
# Data Management: Polity Score is chosen as the response variable. 
# -10 to -5: Autocratic, -5 to 5: Anocratic and 5 to 10: Democratic
# Here, Anocratic countries are coded as 0 and countries with political 
# biases are coded as 1. Hence we have out bivariate response variable

def RecodePolityScore (row):
   if row['polityscore']<=-5 or row['polityscore']>=5 :
      return 1
   elif row['polityscore']>-5 and row['polityscore']<5 :
      return 0
       

# Check that recoding is done      
data_clean['polityscore'] = data_clean.apply(lambda row: RecodePolityScore(row),axis=1)
chk1d = data_clean['polityscore'].value_counts(sort=False, dropna=True)
print (chk1d)

#%%
"""
Modeling and Prediction
"""
#Split into training and testing sets

predictors = data_clean[[
'internetuserate',
'femaleemployrate',
'urbanrate'
]]

targets = data_clean.polityscore

pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)
print("Shape")
print("-----")
print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print("Predictions")
print("-----------")
print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print("Accuracy of the Model")
print("---------------------")
print(sklearn.metrics.accuracy_score(tar_test, predictions))

#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO 
#from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydot_ng as pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
with open('DecisionTree.png', 'wb') as f:
    f.write(graph.create_png())
    

No comments:

Post a Comment