I am using the Gapminder
dataset. Decision tree analysis was performed to test nonlinear relationships
among a series of explanatory variables and a binary, categorical response
variable.
Data Management
The Gapminder dataset does not have any binary
categorical variable. So, I have converted the variable ‘polityscore’ into a
binary categorical variable. The variable polity score ranges from -10 to 10.
-10 to -5 is considered Autocratic. -5 to 5 is considered Anocratic. Countries
falling in this range is usually politically neutral. 5 to 10 is Democratic. The
variable is recoded such that if a country is Autocratic or Democratic, then
the value is 1. Otherwise the value is 0.
The Decision Tree
Above decision tree is constructed with response variable or target as polity score and 3 quantitative explanatory variables or predictors - internet use rate, female employment rate and urban population rate. Looking at the tree from top to bottom (depth
first), the first split (first node) is based on X[0] <= 38.05. If internet
use rate (X[0], or first explanatory variable) is less than 38.05% move to left
branch. Otherwise go right. In the second level the split is based on the
second explanatory variable (X[1]), female employment rate. In this fashion the
tree grows recursively splitting on certain condition imposed on the
explanatory variables. One can follow the tree in a depth first fashion and
reach one of the leaves which shows the values of response variable as a result
of the decisions made along the branches.
Shape of the training and test samples
The above output shows the shape of the training
and test samples. 87 observations, 60% of the dataset is used as training set.
The remaining 40% or 59 observations are used as test set. There are 3 predictors or response variables used in this example.
Correct and Incorrect classification of Decision
Tree
The diagonal, 11, 35 reflects the number of true
negatives and true positives respectively. The diagonal 5, 8 reflects the false
negatives and false positives for the variable polity score. Where value of 1
for the variable means the country is politically biased, i.e autocratic and
democratic. 0 implies the country is neutral.
The accuracy of the model came up as 0.779661016949
Code
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 6 23:33:14 2016
@author: Abhishek
"""
from pandas import Series, DataFrame
import pandas
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
#os.chdir("C:\Data Analysis and Interpretation")
"""
Data Engineering and Analysis
"""
#Load the dataset
data = pandas.read_csv('gapminder.csv')
# Replace empty values with NaN
data['polityscore'] = data['polityscore'].astype(np.object)
data['polityscore'] = data['polityscore'].replace(' ',np.nan)
data['polityscore'] = data['polityscore'].replace('',np.nan)
data['femaleemployrate'] = data['femaleemployrate'].astype(np.object)
data['femaleemployrate'] = data['femaleemployrate'].replace(' ',np.nan)
data['femaleemployrate'] = data['femaleemployrate'].replace('',np.nan)
data['internetuserate'] = data['internetuserate'].astype(np.object)
data['internetuserate'] = data['internetuserate'].replace(' ',np.nan)
data['internetuserate'] = data['internetuserate'].replace('',np.nan)
data['urbanrate'] = data['urbanrate'].astype(np.object)
data['urbanrate'] = data['urbanrate'].replace('',np.nan)
data['urbanrate'] = data['urbanrate'].replace(' ',np.nan)
# Target Variable
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')
# Predictor Variables
data['femaleemployrate'] = pandas.to_numeric(data['femaleemployrate'], errors='coerce')
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data_clean = data.dropna()
#print(data_clean.dtypes)
#print(data_clean.describe())
#%%
# Data Management: Polity Score is chosen as the response variable.
# -10 to -5: Autocratic, -5 to 5: Anocratic and 5 to 10: Democratic
# Here, Anocratic countries are coded as 0 and countries with political
# biases are coded as 1. Hence we have out bivariate response variable
def RecodePolityScore (row):
if row['polityscore']<=-5 or row['polityscore']>=5 :
return 1
elif row['polityscore']>-5 and row['polityscore']<5 :
return 0
# Check that recoding is done
data_clean['polityscore'] = data_clean.apply(lambda row: RecodePolityScore(row),axis=1)
chk1d = data_clean['polityscore'].value_counts(sort=False, dropna=True)
print (chk1d)
#%%
"""
Modeling and Prediction
"""
#Split into training and testing sets
predictors = data_clean[[
'internetuserate',
'femaleemployrate',
'urbanrate'
]]
targets = data_clean.polityscore
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print("Shape")
print("-----")
print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)
#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print("Predictions")
print("-----------")
print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print("Accuracy of the Model")
print("---------------------")
print(sklearn.metrics.accuracy_score(tar_test, predictions))
#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO
#from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydot_ng as pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
with open('DecisionTree.png', 'wb') as f:
f.write(graph.create_png())
No comments:
Post a Comment