Analysing New York's Popular Baby Names Dataset.
In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
pd.set_option("display.precision", 2)
In [2]:
# Answering some questions about this Popular Baby Names Dataset in New York.
# This dataset was extracted from https://www.data.gov, the site of the U.S.
# Government’s open data.
In [3]:
data = pd.read_csv('Popular_Baby_Names_New_York_City.csv')
data.head()
Out[3]:
In [4]:
# Taking a look at data dimensionality:
In [5]:
print(data.shape)
In [6]:
# From the output, we can see that the table contains 19418 rows and 6 columns.
In [7]:
# How many female and male (Gender feature) are represented in this dataset?
In [8]:
data['Gender'].value_counts()
Out[8]:
In [9]:
# 9933 females and 9485 males are represented in this dataset.
In [10]:
# Exploring more the dataset, we sort the values on it by Year of Birth:
In [11]:
data.sort_values(by='Year of Birth', ascending=False)
Out[11]:
In [12]:
#What are the ethnicities represented in this dataset?
In [13]:
data['Ethnicity'].value_counts()
Out[13]:
In [14]:
# As we can see, we have a very diverse ethnicity representation.
In [15]:
# What is the percentage of Hispanic females and males (Ethnicity feature)?
In [16]:
float((data['Ethnicity'] == 'HISPANIC').sum()) / data.shape[0]
Out[16]:
In [17]:
np.around([0.29426305489751775], decimals=2)
Out[17]:
In [18]:
# The percentage of Hispanic females and males is around the 29%.
In [19]:
# What is the percentage of White Non Hispanic females and males (Ethnicity feature)?
In [20]:
float((data['Ethnicity'] == 'WHITE NON HISPANIC').sum()) / data.shape[0]
Out[20]:
In [21]:
np.around([0.28185188999897004], decimals=2)
Out[21]:
In [22]:
# The percentage of Non White Hispanic females and males is around the 28%.
In [23]:
# What is the average Year of Birth of females in this dataset?
In [24]:
data[data['Gender']=='FEMALE']['Year of Birth'].mean()
Out[24]:
In [25]:
#The year 2013
In [26]:
# What is the average Year of Birth of males in this dataset?
In [27]:
data[data['Gender']=='MALE']['Year of Birth'].mean()
Out[27]:
In [28]:
#Also the year 2013.
In [29]:
#What are the some of the names among the BLACK NON HISPANIC ethnicity?
In [30]:
data[data['Ethnicity'] == 'BLACK NON HISPANIC']['Child\'s First Name']
Out[30]:
In [31]:
#Looking the above results, we can see that some of the names among the BLACK NON HISPANIC ethnicity had a Middle East-Jewish
# influence and origin. For example, names like Fatou, Layla, Abdoul, Levi, Aaron.
In [40]:
# We will analyse more the popular names among the ASIAN AND PACIFIC ISLANDER babies.
# For this propose we are extracting a portion of the dataset and selecting this ethnicity.
gb = data.groupby("Ethnicity")
asian_pac_islander = gb.get_group("ASIAN AND PACIFIC ISLANDER")
asian_pac_islander[546:576]
Out[40]:
In [42]:
# We can have some insides about the anglo-saxon origins and influences of names like Mia, Melody, Phoebe and Megan.
Comments
Post a Comment