Kaggle - Classification with Bank Churn
1. Projet
L’objectif de ce projet est de prédire si un client va continuer à utiliser les services de la banque ou s’il va clôturer son compte (churn)
2. Analyse des données
lecture du dataset
df = pd.read_csv(f'{os.path.dirname(__file__)}/kaggle/bank_churn_train_data.csv')
ID CustomerId ... EstimatedSalary Exited 0 37765 15794860 ... 161205.61 0 1 130453 15728005 ... 181419.29 0 2 77297 15686810 ... 100862.54 0 3 40858 15760244 ... 61164.45 1 4 19804 15810563 ... 103737.82 0 ... ... ... ... ... ... 143574 97639 15759915 ... 103349.74 0 143575 95939 15769974 ... 121299.14 0 143576 152315 15592028 ... 57569.89 0 143577 117952 15804009 ... 84496.78 0 143578 43567 15771409 ... 140937.98 1 [143579 rows x 14 columns]
df.head()
ID CustomerId Surname ... IsActiveMember EstimatedSalary Exited 0 37765 15794860 Ch'eng ... 1.0 161205.61 0 1 130453 15728005 Hargreaves ... 1.0 181419.29 0 2 77297 15686810 Ts'ui ... 1.0 100862.54 0 3 40858 15760244 Trevisano ... 0.0 61164.45 1 4 19804 15810563 French ... 1.0 103737.82 0 [5 rows x 14 columns]
df.describe()
ID CustomerId ... EstimatedSalary Exited count 143579.000000 1.435790e+05 ... 143579.000000 143579.000000 mean 82521.171097 1.569202e+07 ... 112530.072465 0.212078 std 47650.353367 7.142049e+04 ... 50301.718378 0.408781 min 0.000000 1.556570e+07 ... 11.580000 0.000000 25% 41259.500000 1.563299e+07 ... 74580.800000 0.000000 50% 82485.000000 1.569018e+07 ... 117931.100000 0.000000 75% 123793.500000 1.575685e+07 ... 155149.685000 0.000000 max 165033.000000 1.581569e+07 ... 199992.480000 1.000000 [8 rows x 11 columns]
Variable continue : 'Balance
Balance : Le solde du compte du client
df['Balance'].describe()
count 143579.000000 mean 55533.640642 std 62822.616346 min 0.000000 25% 0.000000 50% 0.000000 75% 119948.090000 max 250898.090000 Name: Balance, dtype: float64
sns.histplot(data=df, x='Balance')
sns.boxplot(data=df, x='Balance')
fig, ax = plt.subplots(2, 1, sharex=True)
plt.suptitle('Répartition du solde du compte du client')
sns.histplot(data=df, x='Balance', ax=ax[0])
sns.boxplot(data=df, x='Balance', ax=ax[1])
Variable continue : 'NumOfProducts
NumOfProducts : Le nombre de produits bancaires utilisés par le client (par exemple, compte d’épargne, carte de crédit)
df['NumOfProducts'].describe()
count 143579.000000 mean 1.553932 std 0.546754 min 1.000000 25% 1.000000 50% 2.000000 75% 2.000000 max 4.000000 Name: NumOfProducts, dtype: float64
sns.histplot(data=df, x='NumOfProducts')
sns.boxplot(data=df, x='NumOfProducts')
fig, ax = plt.subplots(2, 1, sharex=True)
plt.suptitle('Répartition du nombre de produits bancaires des clients')
sns.histplot(data=df, x='NumOfProducts', ax=ax[0])
sns.boxplot(data=df, x='NumOfProducts', ax=ax[1])
Variable discrète : 'HasCrCard
HasCrCard : Si le client possède une carte de crédit (1 = oui, 0 = non)
calcul des effectifs
df['HasCrCard'].value_counts(normalize=False, sort=True, ascending=False)
HasCrCard 1.0 108274 0.0 35305 Name: count, dtype: int64
distribution des effectifs
df['HasCrCard'].value_counts(normalize=False, sort=True, ascending=False)
HasCrCard 1.0 0.754107 0.0 0.245893 Name: proportion, dtype: float64
sns.scatterplot(data=df, x='HasCrdCart')
Variable discrète : 'IsActiveMember
IsActiveMember : Si le client est un membre actif (1 = oui, 0 = non)
calcul des effectifs
df['IsActiveMember'].value_counts(normalize=False, sort=True, ascending=False)
IsActiveMember 0.0 72249 1.0 71330 Name: count, dtype: int64
distribution des effectifs
df['IsActiveMember'].value_counts(normalize=False, sort=True, ascending=False)
IsActiveMember 0.0 0.5032 1.0 0.4968 Name: proportion, dtype: float64
sns.scatterplot(data=df, x='HasCrdCart')
Variable continue : 'EstimatedSalary
EstimatedSalary : Le salaire estimé du client
df['EstimatedSalary'].describe()
count 143579.000000 mean 112530.072465 std 50301.718378 min 11.580000 25% 74580.800000 50% 117931.100000 75% 155149.685000 max 199992.480000 Name: EstimatedSalary, dtype: float64
sns.histplot(data=df, x='EstimatedSalary')
sns.boxplot(data=df, x='EstimatedSalary')
fig, ax = plt.subplots(2, 1, sharex=True)
plt.suptitle('Répartition des salaires estimés des clients')
sns.histplot(data=df, x='EstimatedSalary', ax=ax[0])
sns.boxplot(data=df, x='EstimatedSalary', ax=ax[1])
Variable discrète : 'Exited
(La target) Exited : Si le client a résilié (1 = oui, 0 = non)
calcul des effectifs
df['Exited'].value_counts(normalize=False, sort=True, ascending=False)
Exited 0 113129 1 30450 Name: count, dtype: int64
distribution des effectifs
df['Exited'].value_counts(normalize=False, sort=True, ascending=False)
Exited 0 0.787922 1 0.212078 Name: proportion, dtype: float64
sns.scatterplot(data=df, x='Exited')
3. Traitement des données
EstimatedSalary: Normalization MinMax
Modèle de Classification KNeighborsClassifier
4. Make predictions
lecture du dataset de test
df = pd.read_csv(f'{os.path.dirname(__file__)}/kaggle/bank_churn_test_data.csv')
ID CustomerId ... EstimatedSalary Salary_Normalized 0 67897 15585246 ... 91830.75 0.459140 1 163075 15604551 ... 90876.95 0.454370 2 134760 15729040 ... 47777.15 0.238851 3 68707 15792329 ... 82696.84 0.413466 4 3428 15617166 ... 151887.16 0.759450 ... ... ... ... ... ... 21450 24790 15697574 ... 175072.47 0.875388 21451 152608 15682708 ... 156680.71 0.783420 21452 28134 15614215 ... 173599.38 0.868022 21453 123871 15587573 ... 161479.19 0.807415 21454 98510 15598070 ... 62390.59 0.311925 [21455 rows x 14 columns]
Prédictions
df = pd.read_csv(f'{os.path.dirname(__file__)}/kaggle/bank_churn_test_data.csv')
scaler = MinMaxScaler()
df['Salary_Normalized'] = scaler.fit_transform(df[['EstimatedSalary']])
X_new = df[['HasCrCard', 'IsActiveMember', 'Salary_Normalized']]
predictions = model.predict(X_new)
[0 0 0 ... 0 1 0]
Submission
results = pd.DataFrame(
{
'ID': df['ID'],
'Exited': predictions
}
)
results = results.set_index('ID')
Exited ID 67897 0 163075 0 134760 0 68707 0 3428 0 ... ... 24790 0 152608 0 28134 0 123871 1 98510 0 [21455 rows x 1 columns]