Penguins - Encoding




1. 1. Analyse du dataset


Analyse du Dataset
df = sns.load_dataset('penguins')
df.dropna()
df.head()
  species     island  bill_length_mm  ...  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1  ...              181.0       3750.0    Male
1  Adelie  Torgersen            39.5  ...              186.0       3800.0  Female
2  Adelie  Torgersen            40.3  ...              195.0       3250.0  Female
4  Adelie  Torgersen            36.7  ...              193.0       3450.0  Female
5  Adelie  Torgersen            39.3  ...              190.0       3650.0    Male

[5 rows x 7 columns]
Colonnes non numériques
3 colonnes ne sont pas numériques : 'species', 'island', 'sex'

Liste des valeurs de la colonne 'species' (la target)
df['species'].unique()
['Adelie' 'Chinstrap' 'Gentoo']

Liste des valeurs de la colonne 'island'
df['island'].unique()
['Torgersen' 'Biscoe' 'Dream']

Liste des valeurs de la colonne 'sex'
df['sex'].unique()
['Male' 'Female']



2. 2. Encodage

Aucune de ces variables n'est hiérarchique => Encodage nominal OneHot


OneHot Encoder sur 'Island' et 'Sex' (Target = 'species'))
from sklearn.preprocessing import OneHotEncoder

df = sns.load_dataset('penguins')
df.dropna()

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['island', 'sex']])
encoder.transform(df[['island', 'sex']])
     island_Dream  island_Torgersen  sex_Male
0             0.0               1.0       1.0
1             0.0               1.0       0.0
2             0.0               1.0       0.0
4             0.0               1.0       0.0
5             0.0               1.0       1.0
..            ...               ...       ...
338           0.0               0.0       0.0
340           0.0               0.0       0.0
341           0.0               0.0       1.0
342           0.0               0.0       0.0
343           0.0               0.0       1.0

[333 rows x 3 columns]



3. 3. Résultat avec LabelBinarizer


Target 'Species' avec LabelBinarizer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer

result = LabelBinarizer().fit_transform(df[['species']])[:10]
[[1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]]



4. 3. Résultat avec LabelEncoder


Target 'Species' avec LabelEncoder
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

result = LabelEncoder().fit_transform(df[['species']])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]