Flights - Encoding




1. 1. Analyse du dataset


Analyse du Dataset
df = sns.load_dataset('flights')
df.head()
   year month  passengers
0  1949   Jan         112
1  1949   Feb         118
2  1949   Mar         132
3  1949   Apr         129
4  1949   May         121
1 colonne n'est pas numérique :
- 'month' : Mois de la date du vol. Pas de hiérarchie => nominal encoding

Liste des valeurs de la colonne 'year'
df['year'].unique()
[1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960]

Liste des valeurs de la colonne 'month'
df['month'].unique()
['Jan', 'Feb', 'Mar', 'Apr', 'May', ..., 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
Length: 12
Categories (12, object): ['Jan', 'Feb', 'Mar', 'Apr', ..., 'Sep', 'Oct', 'Nov', 'Dec']



2. 2. Encodage

Aucune de ces variables n'est hiérarchique => Encodage nominal OneHot


OneHot Encoder - 'Month'
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(df[['month']])
encoder.transform(df[['month']])
     month_Apr  month_Aug  month_Dec  ...  month_Nov  month_Oct  month_Sep
0          0.0        0.0        0.0  ...        0.0        0.0        0.0
1          0.0        0.0        0.0  ...        0.0        0.0        0.0
2          0.0        0.0        0.0  ...        0.0        0.0        0.0
3          1.0        0.0        0.0  ...        0.0        0.0        0.0
4          0.0        0.0        0.0  ...        0.0        0.0        0.0
..         ...        ...        ...  ...        ...        ...        ...
139        0.0        1.0        0.0  ...        0.0        0.0        0.0
140        0.0        0.0        0.0  ...        0.0        0.0        1.0
141        0.0        0.0        0.0  ...        0.0        1.0        0.0
142        0.0        0.0        0.0  ...        1.0        0.0        0.0
143        0.0        0.0        1.0  ...        0.0        0.0        0.0

[144 rows x 12 columns]
/!\ Cela crée 12 colonnes



3. 3. Encodage ordinal


Ordinal Encoder - 'Month'
month_order=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
encoder = OrdinalEncoder(categories = [month_order])
encoder.fit(df[['month']])
result = encoder.transform(df[['month']])
     month
0      0.0
1      1.0
2      2.0
3      3.0
4      4.0
..     ...
139    7.0
140    8.0
141    9.0
142   10.0
143   11.0

[144 rows x 1 columns]