Analysing the most popular film locations in San Francisco. Part 2

cleaning_organizing

Continuing with our project, we will solve some messy aspect in our film_locations dataset

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('./Film_Locations_in_San_Francisco.csv', nrows=20)
df
Out[2]:
Title Release Year Locations Fun Facts Production Company Distributor Director Writer Actor 1 Actor 2 Actor 3
0 180 2011 Epic Roasthouse (399 Embarcadero) NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
1 180 2011 Mason & California Streets (Nob Hill) NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
2 180 2011 Justin Herman Plaza NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
3 180 2011 200 block Market Street NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
4 180 2011 City Hall NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
5 180 2011 Polk & Larkin Streets NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
6 180 2011 Randall Museum NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
7 180 2011 555 Market St. NaN SPI Cinemas NaN Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
8 24 Hours on Craigslist 2005 NaN NaN Yerba Buena Productions Zealot Pictures Michael Ferris Gibson NaN Craig Newmark NaN NaN
9 A Night Full of Rain 1978 Embarcadero Freeway Embarcadero Freeway, which was featured in the... Liberty Film Warner Bros. Pictures Lina Wertmuller Lina Wertmuller Candice Bergen Giancarlo Gianni NaN
10 A Night Full of Rain 1978 Fairmont Hotel (950 Mason Street, Nob Hill) In 1945 the Fairmont hosted the United Nations... Liberty Film Warner Bros. Pictures Lina Wertmuller Lina Wertmuller Candice Bergen Giancarlo Gianni NaN
11 A Night Full of Rain 1978 San Francisco Chronicle (901 Mission Street at... The San Francisco Zodiac Killer of the late 19... Liberty Film Warner Bros. Pictures Lina Wertmuller Lina Wertmuller Candice Bergen Giancarlo Gianni NaN
12 A Night Full of Rain 1978 Broadway (North Beach) NaN Liberty Film Warner Bros. Pictures Lina Wertmuller Lina Wertmuller Candice Bergen Giancarlo Gianni NaN
13 About a Boy 2014 Broderick from Fulton to McAlister NaN NBC Studios National Broadcasting Company Mark J. Kunerth Jason Katims David Walton Minnie Driver NaN
14 About a Boy 2014 Crissy Field NaN NBC Studios National Broadcasting Company Mark J. Kunerth Jason Katims David Walton Minnie Driver NaN
15 About a Boy 2014 Powell from Bush and Sutter NaN NBC Studios National Broadcasting Company Mark J. Kunerth Jason Katims David Walton Minnie Driver NaN
16 Age of Adaline 2015 Pier 50- end of the pier NaN Lionsgate / Sidney Kimmel Entertainment / Lake... NaN Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn
17 Age of Adaline 2015 California @ Montgomery NaN Lionsgate / Sidney Kimmel Entertainment / Lake... NaN Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn
18 Age of Adaline 2015 Montgomery/Green NaN Lionsgate / Sidney Kimmel Entertainment / Lake... NaN Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn
19 Age of Adaline 2015 Driving various SF Streets NaN Lionsgate / Sidney Kimmel Entertainment / Lake... NaN Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn

Before continuing, we need to fix our first problem: drop the rows that contain NAN(missing values). To solve it, first, we will verify how many columns contains missing values:

In [3]:
print("Number of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("Number of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total number of columns in the dataframe")
print(len(df.columns))
Number of columns containing null values
6
Number of columns not containing null values
5
Total number of columns in the dataframe
11

Our dataframe it contained 11 columns, of which 6 contained at least one null value.

We will automatically remove columns and rows depending on which has more null values:

In [4]:
df = df.drop(df.columns[df.isna().sum() > len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
In [5]:
df
Out[5]:
Title Release Year Locations Production Company Director Writer Actor 1 Actor 2 Actor 3
0 180 2011 Epic Roasthouse (399 Embarcadero) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
1 180 2011 Mason & California Streets (Nob Hill) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
2 180 2011 Justin Herman Plaza SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
3 180 2011 200 block Market Street SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
4 180 2011 City Hall SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
5 180 2011 Polk & Larkin Streets SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
6 180 2011 Randall Museum SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
7 180 2011 555 Market St. SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Siddarth Nithya Menon Priya Anand
8 Age of Adaline 2015 Pier 50- end of the pier Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn
9 Age of Adaline 2015 California @ Montgomery Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn
10 Age of Adaline 2015 Montgomery/Green Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn
11 Age of Adaline 2015 Driving various SF Streets Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Blake Lively Harrison Ford Ellen Burstyn

The next problem with the dataframe is that the columns Actor 1, Actor 2, and Actor 3, are unnecesary, and can be reduce it to one variable. After melt the data, and keeping the most part of columns intact, we will rename the variable as Actors, and its value as Actor_Name.

In [6]:
#Melting the data:
df_long = pd.melt(df, id_vars= ['Title', 'Release Year', 'Locations',  
                                'Production Company', 'Director', 'Writer'],
                       var_name = 'Actors',
                       value_name = 'Actor_Name')
In [7]:
df_long
Out[7]:
Title Release Year Locations Production Company Director Writer Actors Actor_Name
0 180 2011 Epic Roasthouse (399 Embarcadero) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
1 180 2011 Mason & California Streets (Nob Hill) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
2 180 2011 Justin Herman Plaza SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
3 180 2011 200 block Market Street SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
4 180 2011 City Hall SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
5 180 2011 Polk & Larkin Streets SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
6 180 2011 Randall Museum SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
7 180 2011 555 Market St. SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 1 Siddarth
8 Age of Adaline 2015 Pier 50- end of the pier Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 1 Blake Lively
9 Age of Adaline 2015 California @ Montgomery Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 1 Blake Lively
10 Age of Adaline 2015 Montgomery/Green Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 1 Blake Lively
11 Age of Adaline 2015 Driving various SF Streets Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 1 Blake Lively
12 180 2011 Epic Roasthouse (399 Embarcadero) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
13 180 2011 Mason & California Streets (Nob Hill) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
14 180 2011 Justin Herman Plaza SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
15 180 2011 200 block Market Street SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
16 180 2011 City Hall SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
17 180 2011 Polk & Larkin Streets SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
18 180 2011 Randall Museum SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
19 180 2011 555 Market St. SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 2 Nithya Menon
20 Age of Adaline 2015 Pier 50- end of the pier Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 2 Harrison Ford
21 Age of Adaline 2015 California @ Montgomery Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 2 Harrison Ford
22 Age of Adaline 2015 Montgomery/Green Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 2 Harrison Ford
23 Age of Adaline 2015 Driving various SF Streets Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 2 Harrison Ford
24 180 2011 Epic Roasthouse (399 Embarcadero) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
25 180 2011 Mason & California Streets (Nob Hill) SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
26 180 2011 Justin Herman Plaza SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
27 180 2011 200 block Market Street SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
28 180 2011 City Hall SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
29 180 2011 Polk & Larkin Streets SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
30 180 2011 Randall Museum SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
31 180 2011 555 Market St. SPI Cinemas Jayendra Umarji Anuradha, Jayendra, Aarthi Sriram, & Suba Actor 3 Priya Anand
32 Age of Adaline 2015 Pier 50- end of the pier Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 3 Ellen Burstyn
33 Age of Adaline 2015 California @ Montgomery Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 3 Ellen Burstyn
34 Age of Adaline 2015 Montgomery/Green Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 3 Ellen Burstyn
35 Age of Adaline 2015 Driving various SF Streets Lionsgate / Sidney Kimmel Entertainment / Lake... Lee Toland Krieger J. Mills Goodloe Actor 3 Ellen Burstyn

Implementing melt is very useful for modeling proposes. For example, before transfering data to a database, we can use this technique to organize more the data.

In [8]:
df.shape
Out[8]:
(12, 9)
In [9]:
df_long.shape
Out[9]:
(36, 8)

The memory usage has been reduce, after implementing melt.

In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Title               12 non-null     object
 1   Release Year        12 non-null     int64 
 2   Locations           12 non-null     object
 3   Production Company  12 non-null     object
 4   Director            12 non-null     object
 5   Writer              12 non-null     object
 6   Actor 1             12 non-null     object
 7   Actor 2             12 non-null     object
 8   Actor 3             12 non-null     object
dtypes: int64(1), object(8)
memory usage: 992.0+ bytes
In [11]:
df_long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Title               36 non-null     object
 1   Release Year        36 non-null     int64 
 2   Locations           36 non-null     object
 3   Production Company  36 non-null     object
 4   Director            36 non-null     object
 5   Writer              36 non-null     object
 6   Actors              36 non-null     object
 7   Actor_Name          36 non-null     object
dtypes: int64(1), object(7)
memory usage: 2.4+ KB

Finally, we will save our dataframe into a csv file:

In [12]:
df_long.to_csv('df_long.csv', index=False)

Comments