Code
// Comment # -*- coding: utf-8 -*- """ Frequency Distribution """ import pandas import numpy # low_memory increases the efficiency of read # only after the data is downloaded will it put in the data frame data = pandas.read_csv('gapminder.csv', low_memory=False) # Convert all DataFrame column names to upper case data.columns = map(str.lower, data.columns) # Bug fix for display format to avoid runtime errors pandas.set_option('display.float_format', lambda x: '%f'%x) # Get data for only the G20 countries (19 countries) dataG20Copy = data[(data['country'] == 'Argentina') | (data['country'] == 'Australia') | (data['country'] == 'Brazil') | (data['country'] == 'Canada') | (data['country'] == 'China') | (data['country'] == 'France') | (data['country'] == 'Germany') | (data['country'] == 'India') | (data['country'] == 'Indonesia') | (data['country'] == 'Italy') | (data['country'] == 'Japan') | (data['country'] == 'Mexico') | (data['country'] == 'Russia') | (data['country'] == 'Saudi Arabia') | (data['country'] == 'South Africa') | (data['country'] == 'Korea, Rep.') | (data['country'] == 'Turkey') | (data['country'] == 'United Kingdom') | (data['country'] == 'United States')] # Not always necessary but can eliminate a setting with copy warning that is displayed dataG20 = dataG20Copy.copy() print ('------- femaleemployrate of G20 countries -------') print('Count: ') filter_values = [10, 25, 50, 75, 100] ranges = pandas.cut(dataG20['femaleemployrate'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True) print(ranges) print('Percentages: All available countries ') p1 = pandas.cut(dataG20['femaleemployrate'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True, normalize=True) * 100 print(p1) print ('------- Incomeperperson of G20 countries -------') print('Count: ') filter_values = [10000,25000,50000,75000,100000,200000] ranges = pandas.cut(dataG20['incomeperperson'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True) print(ranges) print('Percentages: ') p1 = pandas.cut(dataG20['incomeperperson'].convert_objects(convert_numeric=True), bins = filter_values).value_counts(sort=True, dropna=True, normalize=True) * 100 print(p1) print ('------- Polityscore G20 Countries -------') print('Count: ') c1 = dataG20["polityscore"].convert_objects(convert_numeric=True).value_counts(sort=True, dropna=True) print(c1) print('Percentages: ') p1 = dataG20['polityscore'].convert_objects(convert_numeric=True).value_counts(sort=True, dropna=True, normalize=True) * 100 print(p1)
Results
I have shown the frequency distribution of – femaleeploymentrate, incomeperperson and polityscore for G20 countries. Initial data obtained from gapminder CSV file has been filtered to just the G20 countries (19 countries). Please see code above.For femaleemploymentrate and incomeperson, the frequency of individual item was 1. So, I used ranges for these variables. It was easy to figure out the max and min of these fields and the range distribution was done accordingly. You can see in the results that for femaleemploymentrate the ranges are 10-24, 25-49, 50-74, 75-100. For incomeperperson the ranges are 10000 - 24000, 25000 - 49000, 50000- 74000, 75000 - 99000, 100000 – 200000. We get much better distribution this way.