Flights - Encoding
1. 1. Analyse du dataset
Analyse du Dataset
df = sns.load_dataset('flights')
df.head()
year month passengers 0 1949 Jan 112 1 1949 Feb 118 2 1949 Mar 132 3 1949 Apr 129 4 1949 May 121
1 colonne n'est pas numérique :
- 'month' : Mois de la date du vol. Pas de hiérarchie => nominal encoding
- 'month' : Mois de la date du vol. Pas de hiérarchie => nominal encoding
Liste des valeurs de la colonne 'year'
df['year'].unique()
[1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960]
Liste des valeurs de la colonne 'month'
df['month'].unique()
['Jan', 'Feb', 'Mar', 'Apr', 'May', ..., 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] Length: 12 Categories (12, object): ['Jan', 'Feb', 'Mar', 'Apr', ..., 'Sep', 'Oct', 'Nov', 'Dec']
2. 2. Encodage
Aucune de ces variables n'est hiérarchique => Encodage nominal OneHot
OneHot Encoder - 'Month'
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(df[['month']])
encoder.transform(df[['month']])
month_Apr month_Aug month_Dec ... month_Nov month_Oct month_Sep 0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1 0.0 0.0 0.0 ... 0.0 0.0 0.0 2 0.0 0.0 0.0 ... 0.0 0.0 0.0 3 1.0 0.0 0.0 ... 0.0 0.0 0.0 4 0.0 0.0 0.0 ... 0.0 0.0 0.0 .. ... ... ... ... ... ... ... 139 0.0 1.0 0.0 ... 0.0 0.0 0.0 140 0.0 0.0 0.0 ... 0.0 0.0 1.0 141 0.0 0.0 0.0 ... 0.0 1.0 0.0 142 0.0 0.0 0.0 ... 1.0 0.0 0.0 143 0.0 0.0 1.0 ... 0.0 0.0 0.0 [144 rows x 12 columns]
/!\ Cela crée 12 colonnes
3. 3. Encodage ordinal
Ordinal Encoder - 'Month'
month_order=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
encoder = OrdinalEncoder(categories = [month_order])
encoder.fit(df[['month']])
result = encoder.transform(df[['month']])
month 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 .. ... 139 7.0 140 8.0 141 9.0 142 10.0 143 11.0 [144 rows x 1 columns]