Miles per Gallon - OneHot Encoding
1. 1. Analyse du dataset
Analyse du Dataset
df = sns.load_dataset('mpg')
df = df.dropna()
df.head()
mpg cylinders displacement ... model_year origin name 0 18.0 8 307.0 ... 70 usa chevrolet chevelle malibu 1 15.0 8 350.0 ... 70 usa buick skylark 320 2 18.0 8 318.0 ... 70 usa plymouth satellite 3 16.0 8 304.0 ... 70 usa amc rebel sst 4 17.0 8 302.0 ... 70 usa ford torino [5 rows x 9 columns]
2 colonnes ne sont pas numériques et sont nominales (sans ordre) :
- 'origin' : Pays d'origine du modèle de voiture
- 'name' : Nom du modèle de vioture
- 'origin' : Pays d'origine du modèle de voiture
- 'name' : Nom du modèle de vioture
Liste des valeurs de la colonne 'origin'
df['origin'].unique()
['usa' 'japan' 'europe']
Liste des valeurs de la colonne 'name'
df['name'].unique()
['chevrolet chevelle malibu' 'buick skylark 320' 'plymouth satellite' 'amc rebel sst' 'ford torino' 'ford galaxie 500' 'chevrolet impala' 'plymouth fury iii' 'pontiac catalina' 'amc ambassador dpl' 'dodge challenger se' "plymouth 'cuda 340" 'chevrolet monte carlo' 'buick estate wagon (sw)' 'toyota corona mark ii' 'plymouth duster' 'amc hornet' 'ford maverick' 'datsun pl510' 'volkswagen 1131 deluxe sedan' 'peugeot 504' 'audi 100 ls' 'saab 99e' 'bmw 2002' 'amc gremlin' 'ford f250' 'chevy c20' 'dodge d200' 'hi 1200d' 'chevrolet vega 2300' 'toyota corona' 'plymouth satellite custom' 'ford torino 500' 'amc matador' 'pontiac catalina brougham' 'dodge monaco (sw)' 'ford country squire (sw)' 'pontiac safari (sw)' 'amc hornet sportabout (sw)' 'chevrolet vega (sw)' 'pontiac firebird' 'ford mustang' 'mercury capri 2000' 'opel 1900' 'peugeot 304' 'fiat 124b' 'toyota corolla 1200' 'datsun 1200' 'volkswagen model 111' 'plymouth cricket' 'toyota corona hardtop' 'dodge colt hardtop' 'volkswagen type 3' 'chevrolet vega' 'ford pinto runabout' 'amc ambassador sst' 'mercury marquis' 'buick lesabre custom' 'oldsmobile delta 88 royale' 'chrysler newport royal' 'mazda rx2 coupe' 'amc matador (sw)' 'chevrolet chevelle concours (sw)' 'ford gran torino (sw)' 'plymouth satellite custom (sw)' 'volvo 145e (sw)' 'volkswagen 411 (sw)' 'peugeot 504 (sw)' 'renault 12 (sw)' 'ford pinto (sw)' 'datsun 510 (sw)' 'toyouta corona mark ii (sw)' 'dodge colt (sw)' 'toyota corolla 1600 (sw)' 'buick century 350' 'chevrolet malibu' 'ford gran torino' 'dodge coronet custom' 'mercury marquis brougham' 'chevrolet caprice classic' 'ford ltd' 'plymouth fury gran sedan' 'chrysler new yorker brougham' 'buick electra 225 custom' 'amc ambassador brougham' 'plymouth valiant' 'chevrolet nova custom' 'volkswagen super beetle' 'ford country' 'plymouth custom suburb' 'oldsmobile vista cruiser' 'toyota carina' 'datsun 610' 'maxda rx3' 'ford pinto' 'mercury capri v6' 'fiat 124 sport coupe' 'chevrolet monte carlo s' 'pontiac grand prix' 'fiat 128' 'opel manta' 'audi 100ls' 'volvo 144ea' 'dodge dart custom' 'saab 99le' 'toyota mark ii' 'oldsmobile omega' 'chevrolet nova' 'datsun b210' 'chevrolet chevelle malibu classic' 'plymouth satellite sebring' 'buick century luxus (sw)' 'dodge coronet custom (sw)' 'audi fox' 'volkswagen dasher' 'datsun 710' 'dodge colt' 'fiat 124 tc' 'honda civic' 'subaru' 'fiat x1.9' 'plymouth valiant custom' 'mercury monarch' 'chevrolet bel air' 'plymouth grand fury' 'buick century' 'chevroelt chevelle malibu' 'plymouth fury' 'buick skyhawk' 'chevrolet monza 2+2' 'ford mustang ii' 'toyota corolla' 'pontiac astro' 'volkswagen rabbit' 'amc pacer' 'volvo 244dl' 'honda civic cvcc' 'fiat 131' 'capri ii' 'renault 12tl' 'dodge coronet brougham' 'chevrolet chevette' 'chevrolet woody' 'vw rabbit' 'dodge aspen se' 'ford granada ghia' 'pontiac ventura sj' 'amc pacer d/l' 'datsun b-210' 'volvo 245' 'plymouth volare premier v8' 'mercedes-benz 280s' 'cadillac seville' 'chevy c10' 'ford f108' 'dodge d100' 'honda accord cvcc' 'buick opel isuzu deluxe' 'renault 5 gtl' 'plymouth arrow gs' 'datsun f-10 hatchback' 'oldsmobile cutlass supreme' 'dodge monaco brougham' 'mercury cougar brougham' 'chevrolet concours' 'buick skylark' 'plymouth volare custom' 'ford granada' 'pontiac grand prix lj' 'chevrolet monte carlo landau' 'chrysler cordoba' 'ford thunderbird' 'volkswagen rabbit custom' 'pontiac sunbird coupe' 'toyota corolla liftback' 'ford mustang ii 2+2' 'dodge colt m/m' 'subaru dl' 'datsun 810' 'bmw 320i' 'mazda rx-4' 'volkswagen rabbit custom diesel' 'ford fiesta' 'mazda glc deluxe' 'datsun b210 gx' 'oldsmobile cutlass salon brougham' 'dodge diplomat' 'mercury monarch ghia' 'pontiac phoenix lj' 'ford fairmont (auto)' 'ford fairmont (man)' 'plymouth volare' 'amc concord' 'buick century special' 'mercury zephyr' 'dodge aspen' 'amc concord d/l' 'buick regal sport coupe (turbo)' 'ford futura' 'dodge magnum xe' 'datsun 510' 'dodge omni' 'toyota celica gt liftback' 'plymouth sapporo' 'oldsmobile starfire sx' 'datsun 200-sx' 'audi 5000' 'volvo 264gl' 'saab 99gle' 'peugeot 604sl' 'volkswagen scirocco' 'honda accord lx' 'pontiac lemans v6' 'mercury zephyr 6' 'ford fairmont 4' 'amc concord dl 6' 'dodge aspen 6' 'ford ltd landau' 'mercury grand marquis' 'dodge st. regis' 'chevrolet malibu classic (sw)' 'chrysler lebaron town @ country (sw)' 'vw rabbit custom' 'maxda glc deluxe' 'dodge colt hatchback custom' 'amc spirit dl' 'mercedes benz 300d' 'cadillac eldorado' 'plymouth horizon' 'plymouth horizon tc3' 'datsun 210' 'fiat strada custom' 'buick skylark limited' 'chevrolet citation' 'oldsmobile omega brougham' 'pontiac phoenix' 'toyota corolla tercel' 'datsun 310' 'ford fairmont' 'audi 4000' 'toyota corona liftback' 'mazda 626' 'datsun 510 hatchback' 'mazda glc' 'vw rabbit c (diesel)' 'vw dasher (diesel)' 'audi 5000s (diesel)' 'mercedes-benz 240d' 'honda civic 1500 gl' 'vokswagen rabbit' 'datsun 280-zx' 'mazda rx-7 gs' 'triumph tr7 coupe' 'honda accord' 'plymouth reliant' 'dodge aries wagon (sw)' 'toyota starlet' 'plymouth champ' 'honda civic 1300' 'datsun 210 mpg' 'toyota tercel' 'mazda glc 4' 'plymouth horizon 4' 'ford escort 4w' 'ford escort 2h' 'volkswagen jetta' 'honda prelude' 'datsun 200sx' 'peugeot 505s turbo diesel' 'volvo diesel' 'toyota cressida' 'datsun 810 maxima' 'oldsmobile cutlass ls' 'ford granada gl' 'chrysler lebaron salon' 'chevrolet cavalier' 'chevrolet cavalier wagon' 'chevrolet cavalier 2-door' 'pontiac j2000 se hatchback' 'dodge aries se' 'ford fairmont futura' 'volkswagen rabbit l' 'mazda glc custom l' 'mazda glc custom' 'plymouth horizon miser' 'mercury lynx l' 'nissan stanza xe' 'honda civic (auto)' 'datsun 310 gx' 'buick century limited' 'oldsmobile cutlass ciera (diesel)' 'chrysler lebaron medallion' 'ford granada l' 'toyota celica gt' 'dodge charger 2.2' 'chevrolet camaro' 'ford mustang gl' 'vw pickup' 'dodge rampage' 'ford ranger' 'chevy s-10']
2. 2. Encodage OneHot
OneHot Encoder - Origin & Name
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['origin', 'name']])
encoder.transform(df[['origin', 'name']])
origin_europe ... name_vw rabbit custom 0 0.0 ... 0.0 1 0.0 ... 0.0 2 0.0 ... 0.0 3 0.0 ... 0.0 4 0.0 ... 0.0 .. ... ... ... 393 0.0 ... 0.0 394 1.0 ... 0.0 395 0.0 ... 0.0 396 0.0 ... 0.0 397 0.0 ... 0.0 [392 rows x 304 columns]
Matrice creuse (Sparse matrix): Matrice avec énormément de valeurs nulles
sns.heatmap(encoder.transform(df[['origin', 'name']]))
Un point blanc = Une valeur 1 de la matrice d'encodage => matrice creuse
/!\ Une colonne avec 300 valeurs devient 300 colonnes !
/!\ Une colonne avec 300 valeurs devient 300 colonnes !
OneHot Encoder - Origin
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['origin']])
encoder.transform(df[['origin']])
origin_europe origin_japan origin_usa 0 0.0 0.0 1.0 1 0.0 0.0 1.0 2 0.0 0.0 1.0 3 0.0 0.0 1.0 4 0.0 0.0 1.0 .. ... ... ... 393 0.0 0.0 1.0 394 1.0 0.0 0.0 395 0.0 0.0 1.0 396 0.0 0.0 1.0 397 0.0 0.0 1.0 [392 rows x 3 columns]
sns.heatmap(encoder.transform(df[['origin']]))
OneHot Encoder with first column dropped (problème de multi-colinéarité) - Origin
encoder = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')
encoder.fit(df[['origin']])
encoder.transform(df[['origin']])
origin_japan origin_usa 0 0.0 1.0 1 0.0 1.0 2 0.0 1.0 3 0.0 1.0 4 0.0 1.0 .. ... ... 393 0.0 1.0 394 0.0 0.0 395 0.0 1.0 396 0.0 1.0 397 0.0 1.0 [392 rows x 2 columns]
sns.heatmap(encoder.transform(df[['origin']]))
3. 3. Encoder la target
Target: Variable à prévoir en Machine Learning
Inutile d'encoder la target. Le modèle s'en charge
from sklearn.tree import DecisionTreeClassifier
x = df[['mpg', 'cylinders', 'displacement', 'horsepower']]
y = df['name']
model = DecisionTreeClassifier()
model.fit(x,y)
model.predict([[18.0, 8, 307.0, 130.0]])
['chevrolet chevelle malibu']