Sunday, September 27, 2015

Running my first program

Code

// Comment
# -*- coding: utf-8 -*-
"""
Frequency Distribution

"""
import pandas
import numpy

# low_memory increases the efficiency of read
# only after the data is downloaded will it put in the data frame
data = pandas.read_csv('gapminder.csv', low_memory=False)

# Convert all DataFrame column names to upper case
data.columns = map(str.lower, data.columns)

# Bug fix for display format to avoid runtime errors
pandas.set_option('display.float_format', lambda x: '%f'%x)


# Get data for only the G20 countries (19 countries)
dataG20Copy = data[(data['country'] == 'Argentina') |
            (data['country'] == 'Australia') |
            (data['country'] == 'Brazil') |
            (data['country'] == 'Canada') |
            (data['country'] == 'China') |
            (data['country'] == 'France') |
            (data['country'] == 'Germany') |
            (data['country'] == 'India') |
            (data['country'] == 'Indonesia') |
            (data['country'] == 'Italy') |
            (data['country'] == 'Japan') |
            (data['country'] == 'Mexico') |
            (data['country'] == 'Russia') |
            (data['country'] == 'Saudi Arabia') |
            (data['country'] == 'South Africa') |
            (data['country'] == 'Korea, Rep.') |
            (data['country'] == 'Turkey') |
            (data['country'] == 'United Kingdom') |
            (data['country'] == 'United States')]

# Not always necessary but can eliminate a setting with copy warning that is displayed
dataG20 = dataG20Copy.copy()

print ('------- femaleemployrate of G20 countries -------')
print('Count: ')
filter_values = [10, 25, 50, 75, 100]
ranges = pandas.cut(dataG20['femaleemployrate'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True)
print(ranges)

print('Percentages: All available countries ')
p1 = pandas.cut(dataG20['femaleemployrate'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True, normalize=True) * 100
print(p1)

print ('------- Incomeperperson of G20 countries -------')
print('Count: ')
filter_values = [10000,25000,50000,75000,100000,200000]
ranges = pandas.cut(dataG20['incomeperperson'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True)
print(ranges)

print('Percentages: ')
p1 = pandas.cut(dataG20['incomeperperson'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True, normalize=True) * 100
print(p1)

print ('------- Polityscore G20 Countries -------')
print('Count: ')
c1 = dataG20["polityscore"].convert_objects(convert_numeric=True).value_counts(sort=True, dropna=True)
print(c1)

print('Percentages: ')
p1 = dataG20['polityscore'].convert_objects(convert_numeric=True).value_counts(sort=True, dropna=True, normalize=True) * 100
print(p1)


Results

I have shown the frequency distribution of – femaleeploymentrate, incomeperperson and polityscore for G20 countries. Initial data obtained from gapminder CSV file has been filtered to just the G20 countries (19 countries). Please see code above.
For femaleemploymentrate and incomeperson, the frequency of individual item was 1. So, I used ranges for these variables. It was easy to figure out the max and min of these fields and the range distribution was done accordingly. You can see in the results that for femaleemploymentrate the ranges are 10-24, 25-49, 50-74, 75-100. For incomeperperson the ranges are 10000 - 24000, 25000 - 49000, 50000- 74000, 75000 - 99000, 100000 – 200000. We get much better distribution this way.

No comments:

Post a Comment