Taxis - Encoding
1. 1. Analyse du dataset
Analyse du Dataset
df = sns.load_dataset('taxis')
df = df.dropna()
df.head()
pickup dropoff ... pickup_borough dropoff_borough 0 2019-03-23 20:21:09 2019-03-23 20:27:24 ... Manhattan Manhattan 1 2019-03-04 16:11:55 2019-03-04 16:19:00 ... Manhattan Manhattan 2 2019-03-27 17:53:01 2019-03-27 18:00:25 ... Manhattan Manhattan 3 2019-03-10 01:23:59 2019-03-10 01:49:51 ... Manhattan Manhattan 4 2019-03-30 13:27:42 2019-03-30 13:37:14 ... Manhattan Manhattan [5 rows x 14 columns]
6 colonnes ne sont pas numériques : 'color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough'
Liste des valeurs de la colonne 'color'
df['color'].value_counts()
color yellow 5373 green 968 Name: count, dtype: int64
Liste des valeurs de la colonne 'color'
df['color'].unique()
['yellow' 'green']
Liste des valeurs de la colonne 'payment'
df['payment'].unique()
['credit card' 'cash']
Nombre de valeurs de la colonne 'pickup_zone'
df['pickup_zone'].nunique()
194
Liste des valeurs de la colonne 'pickup_zone'
df['pickup_zone'].unique()
['Lenox Hill West' 'Upper West Side South' 'Alphabet City' 'Hudson Sq' 'Midtown East' 'Times Sq/Theatre District' 'Battery Park City' 'East Harlem South' 'Lincoln Square East' 'LaGuardia Airport' 'Murray Hill' 'Lincoln Square West' 'Financial District North' 'Upper West Side North' 'East Chelsea' 'Midtown Center' 'Gramercy' 'Penn Station/Madison Sq West' 'Sutton Place/Turtle Bay North' 'West Chelsea/Hudson Yards' 'Clinton East' 'Clinton West' 'UN/Turtle Bay South' 'Midtown South' 'Midtown North' 'Garment District' 'Lenox Hill East' 'Flatiron' 'TriBeCa/Civic Center' 'Upper East Side North' 'West Village' 'Greenwich Village South' 'JFK Airport' 'East Village' 'Union Sq' 'Yorkville West' 'Central Park' 'Meatpacking/West Village West' 'Kips Bay' 'Morningside Heights' 'Astoria' 'East Tremont' 'Upper East Side South' 'Financial District South' 'Bloomingdale' 'Queensboro Hill' 'SoHo' 'Brooklyn Heights' 'Yorkville East' 'Manhattan Valley' 'DUMBO/Vinegar Hill' 'Little Italy/NoLiTa' 'Mott Haven/Port Morris' 'Greenwich Village North' 'Stuyvesant Heights' 'Lower East Side' 'East Harlem North' 'Chinatown' 'Fort Greene' 'Steinway' 'Central Harlem' 'Crown Heights North' 'Seaport' 'Two Bridges/Seward Park' 'Williamsburg (South Side)' 'Rosedale' 'Flushing' 'Old Astoria' 'Soundview/Castle Hill' 'Stuy Town/Peter Cooper Village' 'World Trade Center' 'Sunnyside' 'Washington Heights South' 'Prospect Heights' 'East New York' 'Hamilton Heights' 'Cobble Hill' 'Long Island City/Queens Plaza' 'Central Harlem North' 'Manhattanville' 'East Flatbush/Farragut' 'Elmhurst' 'East Concourse/Concourse Village' 'Boerum Hill' 'Park Slope' 'Greenpoint' 'Williamsburg (North Side)' 'Long Island City/Hunters Point' 'South Ozone Park' 'Ridgewood' 'Downtown Brooklyn/MetroTech' 'Queensbridge/Ravenswood' 'Williamsbridge/Olinville' 'Bedford' 'Gowanus' 'Jackson Heights' 'South Jamaica' 'Bushwick North' 'West Concourse' 'Queens Village' 'Windsor Terrace' 'Flatlands' 'Van Cortlandt Village' 'Woodside' 'East Williamsburg' 'Fordham South' 'East Elmhurst' 'Flushing Meadows-Corona Park' 'Marine Park/Mill Basin' 'Carroll Gardens' 'Canarsie' 'East Flatbush/Remsen Village' 'Jamaica' 'Marble Hill' 'Bushwick South' 'Erasmus' 'Claremont/Bathgate' 'Pelham Bay' 'Soundview/Bruckner' 'South Williamsburg' 'Battery Park' 'Forest Hills' 'Kew Gardens' 'Maspeth' 'Bronx Park' 'Starrett City' 'Brighton Beach' 'Brownsville' 'Highbridge Park' 'Bensonhurst East' 'Mount Hope' 'Prospect-Lefferts Gardens' 'Bayside' 'Douglaston' 'Midwood' 'North Corona' 'Homecrest' 'Westchester Village/Unionport' 'University Heights/Morris Heights' 'Inwood' 'Washington Heights North' 'Flatbush/Ditmas Park' 'Rego Park' 'Riverdale/North Riverdale/Fieldston' 'Jamaica Estates' 'Borough Park' 'Sunset Park West' 'Belmont' 'Auburndale' 'Schuylerville/Edgewater Park' 'Co-Op City' 'Crown Heights South' 'Spuyten Duyvil/Kingsbridge' 'Morrisania/Melrose' 'Hollis' 'Parkchester' 'Coney Island' 'Richmond Hill' 'Bedford Park' 'Highbridge' 'Clinton Hill' 'Sheepshead Bay' 'Madison' 'Dyker Heights' 'Cambria Heights' 'Pelham Parkway' 'Hunts Point' 'Melrose South' 'Springfield Gardens North' 'Bay Ridge' 'Elmhurst/Maspeth' 'Crotona Park East' 'Bronxdale' 'Briarwood/Jamaica Hills' 'Van Nest/Morris Park' 'Murray Hill-Queens' 'Kingsbridge Heights' 'Whitestone' 'Saint Albans' 'Allerton/Pelham Gardens' 'Howard Beach' 'Norwood' 'Bensonhurst West' 'Columbia Street' 'Middle Village' 'East Flushing' 'Prospect Park' 'Ozone Park' 'Gravesend' 'Glendale' 'Kew Gardens Hills' 'Woodlawn/Wakefield' 'West Farms/Bronx River' 'Hillcrest/Pomonok']
Liste des valeurs de la colonne 'dropoff_zone'
df['dropoff_zone'].unique()
['UN/Turtle Bay South' 'Upper West Side South' 'West Village' 'Yorkville West' 'Midtown East' 'Two Bridges/Seward Park' 'Midtown Center' 'Central Park' 'Astoria' 'Manhattan Valley' 'Times Sq/Theatre District' 'Clinton East' 'Meatpacking/West Village West' 'East Harlem South' 'East Chelsea' 'Kips Bay' 'Murray Hill' 'Sutton Place/Turtle Bay North' 'Midtown North' 'Gramercy' 'Midtown South' 'Seaport' 'Lenox Hill West' 'East Harlem North' 'Garment District' 'West Chelsea/Hudson Yards' 'Clinton West' 'Lenox Hill East' 'Flatiron' 'Carroll Gardens' 'Washington Heights South' 'Battery Park City' 'Penn Station/Madison Sq West' 'Union Sq' 'Sunnyside' 'Lincoln Square West' 'Upper East Side North' 'Financial District North' 'Lower East Side' 'Yorkville East' 'Upper West Side North' 'Jackson Heights' 'Upper East Side South' 'Chinatown' 'Stuy Town/Peter Cooper Village' 'Morningside Heights' 'Lincoln Square East' 'Little Italy/NoLiTa' 'Downtown Brooklyn/MetroTech' 'DUMBO/Vinegar Hill' 'Greenwich Village South' 'LaGuardia Airport' 'East Village' 'JFK Airport' 'Marble Hill' 'Greenwich Village North' 'Williamsburg (North Side)' 'Brooklyn Heights' 'Riverdale/North Riverdale/Fieldston' 'Steinway' 'Sheepshead Bay' 'Crown Heights North' 'TriBeCa/Civic Center' 'Midwood' 'Alphabet City' 'Boerum Hill' 'Financial District South' 'Cypress Hills' 'Park Slope' 'Central Harlem' 'North Corona' 'Greenpoint' 'Long Island City/Hunters Point' 'Hillcrest/Pomonok' 'Bloomingdale' 'Baisley Park' 'Crown Heights South' 'Soundview/Castle Hill' 'World Trade Center' 'Randalls Island' 'Melrose South' 'Williamsburg (South Side)' 'SoHo' 'Hudson Sq' 'Fort Greene' 'Cobble Hill' 'Clinton Hill' 'Central Harlem North' 'East Flushing' 'Old Astoria' 'Briarwood/Jamaica Hills' 'East New York' 'Ridgewood' 'Elmhurst' 'East Williamsburg' 'Williamsbridge/Olinville' 'University Heights/Morris Heights' 'Bushwick South' 'Forest Hills' 'Flushing Meadows-Corona Park' 'Long Island City/Queens Plaza' 'Columbia Street' 'Manhattanville' 'Elmhurst/Maspeth' 'Inwood' 'Woodhaven' 'Hamilton Heights' 'Middle Village' 'Prospect Heights' 'Richmond Hill' 'Mount Hope' 'Bushwick North' 'Canarsie' 'Gowanus' 'Washington Heights North' 'Westchester Village/Unionport' 'Queens Village' 'Woodside' 'Bedford' 'Highbridge' 'Stuyvesant Heights' 'Queensbridge/Ravenswood' 'East Flatbush/Farragut' 'Mott Haven/Port Morris' 'Prospect-Lefferts Gardens' 'Sunset Park West' 'South Jamaica' 'Howard Beach' 'South Williamsburg' 'Woodlawn/Wakefield' 'Rego Park' 'West Concourse' 'Manhattan Beach' 'Battery Park' 'Bronxdale' 'West Brighton' 'Flatlands' 'Glendale' 'East Concourse/Concourse Village' 'Ozone Park' 'South Ozone Park' 'Norwood' 'Parkchester' 'East Tremont' 'Douglaston' 'Windsor Terrace' 'Bensonhurst West' 'Kew Gardens' 'Flatbush/Ditmas Park' 'Starrett City' 'Roosevelt Island' 'Bay Ridge' 'Saint Albans' 'Pelham Parkway' 'Prospect Park' 'Jamaica' 'Murray Hill-Queens' 'Stapleton' 'Maspeth' 'Dyker Heights' 'Allerton/Pelham Gardens' 'Co-Op City' 'Belmont' 'Bensonhurst East' 'Kew Gardens Hills' 'Crotona Park East' 'Van Cortlandt Village' 'Springfield Gardens South' 'Corona' 'Brownsville' 'Red Hook' 'Bayside' 'Van Nest/Morris Park' 'Gravesend' 'Oakland Gardens' 'Claremont/Bathgate' 'Ocean Hill' 'Brighton Beach' 'Spuyten Duyvil/Kingsbridge' 'Kingsbridge Heights' 'Soundview/Bruckner' 'Fresh Meadows' 'East Elmhurst' 'Hunts Point' 'Cambria Heights' 'Whitestone' 'East Flatbush/Remsen Village' 'Rosedale' 'Inwood Hill Park' 'Bedford Park' 'Jamaica Estates' 'Borough Park' 'Flushing' 'Auburndale' 'Bath Beach' 'Queensboro Hill' 'Morrisania/Melrose' 'Madison' 'Homecrest' 'Eastchester' 'College Point' 'Brooklyn Navy Yard' 'Marine Park/Mill Basin']
Liste des valeurs de la colonne 'pickup_borough'
df['pickup_borough'].unique()
['Manhattan' 'Queens' 'Bronx' 'Brooklyn']
Liste des valeurs de la colonne 'dropoff_borough'
df['dropoff_borough'].unique()
['Manhattan' 'Queens' 'Brooklyn' 'Bronx' 'Staten Island']
Toutes les variables semblent être nominales => Utiliser un encodage OneHot.
Cependant, le très grand nombre de catégories dans pickup_zone et dropoff_zone risque de donner une patrice creuse qui engendrera un overfitting, une consommation de RAM,....
Dans la pratique il convient d'utiliser :
- Target Encoding,
- Codage GPS,
- Feature-engineering puor regrouper les zones en cluster
Une corrélation entre pickup_zone et pickup_borough ?
Cependant, le très grand nombre de catégories dans pickup_zone et dropoff_zone risque de donner une patrice creuse qui engendrera un overfitting, une consommation de RAM,....
Dans la pratique il convient d'utiliser :
- Target Encoding,
- Codage GPS,
- Feature-engineering puor regrouper les zones en cluster
Une corrélation entre pickup_zone et pickup_borough ?
2. 2. Encodage
OneHot Encoder - Origin & Name
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']])
encoder.transform(df[['color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']])
color_green ... dropoff_borough_Staten Island 0 0.0 ... 0.0 1 0.0 ... 0.0 2 0.0 ... 0.0 3 0.0 ... 0.0 4 0.0 ... 0.0 ... ... ... ... 6428 1.0 ... 0.0 6429 1.0 ... 0.0 6430 1.0 ... 0.0 6431 1.0 ... 0.0 6432 1.0 ... 0.0 [6341 rows x 410 columns]
Matrice creuse (Sparse matrix): Matrice avec énormément de valeurs nulles
sns.heatmap(encoder.transform(df[['color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']]))
3. 3. Une dépendance entre pickup_zone et pickup_borough ?
Paires uniques (pickup_zone, pickup_borough)
unique_pairs = df[['pickup_zone', 'pickup_borough']].drop_duplicates()
pickup_zone pickup_borough 0 Lenox Hill West Manhattan 1 Upper West Side South Manhattan 2 Alphabet City Manhattan 3 Hudson Sq Manhattan 4 Midtown East Manhattan ... ... ... 6234 Glendale Queens 6248 Kew Gardens Hills Queens 6295 Woodlawn/Wakefield Bronx 6318 West Farms/Bronx River Bronx 6417 Hillcrest/Pomonok Queens [194 rows x 2 columns]
Paires uniques (pickup_zone, pickup_borough) groupées par pickup_zone
result = unique_pairs.groupby('pickup_zone')['pickup_borough']].unique()
pickup_zone
Allerton/Pelham Gardens [Bronx]
Alphabet City [Manhattan]
Astoria [Queens]
Auburndale [Queens]
Battery Park [Manhattan]
...
Woodlawn/Wakefield [Bronx]
Woodside [Queens]
World Trade Center [Manhattan]
Yorkville East [Manhattan]
Yorkville West [Manhattan]
Name: pickup_borough, Length: 194, dtype: object
Paires uniques (pickup_zone, pickup_borough) groupées par pickup_zone
result = unique_pairs.groupby('pickup_zone').size()
pickup_zone
Allerton/Pelham Gardens 1
Alphabet City 1
Astoria 1
Auburndale 1
Battery Park 1
..
Woodlawn/Wakefield 1
Woodside 1
World Trade Center 1
Yorkville East 1
Yorkville West 1
Length: 194, dtype: int64
Paires uniques (pickup_zone, pickup_borough) groupées par pickup_zone
result = unique_pairs.groupby('pickup_zone').size().max()
1
Commencer par encoder en OneHot uniquement les borough (pas les zones)