Kaggle - Classification with Bank Churn

1. Projet
2. Analyse des données
3. Traitement des données
4. Make predictions

1. Projet

L’objectif de ce projet est de prédire si un client va continuer à utiliser les services de la banque ou s’il va clôturer son compte (churn)

https://www.kaggle.com/t/f20a23d66bf24a20a232a97389320fa3

2. Analyse des données

lecture du dataset

df = pd.read_csv(f'{os.path.dirname(__file__)}/kaggle/bank_churn_train_data.csv')

            ID  CustomerId  ... EstimatedSalary  Exited
0        37765    15794860  ...       161205.61       0
1       130453    15728005  ...       181419.29       0
2        77297    15686810  ...       100862.54       0
3        40858    15760244  ...        61164.45       1
4        19804    15810563  ...       103737.82       0
...        ...         ...  ...             ...     ...
143574   97639    15759915  ...       103349.74       0
143575   95939    15769974  ...       121299.14       0
143576  152315    15592028  ...        57569.89       0
143577  117952    15804009  ...        84496.78       0
143578   43567    15771409  ...       140937.98       1

[143579 rows x 14 columns]

df.head()

       ID  CustomerId     Surname  ...  IsActiveMember EstimatedSalary Exited
0   37765    15794860      Ch'eng  ...             1.0       161205.61      0
1  130453    15728005  Hargreaves  ...             1.0       181419.29      0
2   77297    15686810       Ts'ui  ...             1.0       100862.54      0
3   40858    15760244   Trevisano  ...             0.0        61164.45      1
4   19804    15810563      French  ...             1.0       103737.82      0

[5 rows x 14 columns]

df.describe()

                  ID    CustomerId  ...  EstimatedSalary         Exited
count  143579.000000  1.435790e+05  ...    143579.000000  143579.000000
mean    82521.171097  1.569202e+07  ...    112530.072465       0.212078
std     47650.353367  7.142049e+04  ...     50301.718378       0.408781
min         0.000000  1.556570e+07  ...        11.580000       0.000000
25%     41259.500000  1.563299e+07  ...     74580.800000       0.000000
50%     82485.000000  1.569018e+07  ...    117931.100000       0.000000
75%    123793.500000  1.575685e+07  ...    155149.685000       0.000000
max    165033.000000  1.581569e+07  ...    199992.480000       1.000000

[8 rows x 11 columns]

Variable continue : 'Balance

Balance : Le solde du compte du client

df['Balance'].describe()

count    143579.000000
mean      55533.640642
std       62822.616346
min           0.000000
25%           0.000000
50%           0.000000
75%      119948.090000
max      250898.090000
Name: Balance, dtype: float64

sns.histplot(data=df, x='Balance')

sns.boxplot(data=df, x='Balance')

fig, ax = plt.subplots(2, 1, sharex=True)
plt.suptitle('Répartition du solde du compte du client')
sns.histplot(data=df, x='Balance', ax=ax[0])
sns.boxplot(data=df, x='Balance', ax=ax[1])

Variable continue : 'NumOfProducts

NumOfProducts : Le nombre de produits bancaires utilisés par le client (par exemple, compte d’épargne, carte de crédit)

df['NumOfProducts'].describe()

count    143579.000000
mean          1.553932
std           0.546754
min           1.000000
25%           1.000000
50%           2.000000
75%           2.000000
max           4.000000
Name: NumOfProducts, dtype: float64

sns.histplot(data=df, x='NumOfProducts')

sns.boxplot(data=df, x='NumOfProducts')

fig, ax = plt.subplots(2, 1, sharex=True)
plt.suptitle('Répartition du nombre de produits bancaires des clients')
sns.histplot(data=df, x='NumOfProducts', ax=ax[0])
sns.boxplot(data=df, x='NumOfProducts', ax=ax[1])

Variable discrète : 'HasCrCard

HasCrCard : Si le client possède une carte de crédit (1 = oui, 0 = non)

calcul des effectifs

df['HasCrCard'].value_counts(normalize=False, sort=True, ascending=False)

HasCrCard
1.0    108274
0.0     35305
Name: count, dtype: int64

distribution des effectifs

df['HasCrCard'].value_counts(normalize=False, sort=True, ascending=False)

HasCrCard
1.0    0.754107
0.0    0.245893
Name: proportion, dtype: float64

sns.scatterplot(data=df, x='HasCrdCart')

Variable discrète : 'IsActiveMember

IsActiveMember : Si le client est un membre actif (1 = oui, 0 = non)

calcul des effectifs

df['IsActiveMember'].value_counts(normalize=False, sort=True, ascending=False)

IsActiveMember
0.0    72249
1.0    71330
Name: count, dtype: int64

distribution des effectifs

df['IsActiveMember'].value_counts(normalize=False, sort=True, ascending=False)

IsActiveMember
0.0    0.5032
1.0    0.4968
Name: proportion, dtype: float64

sns.scatterplot(data=df, x='HasCrdCart')

Variable continue : 'EstimatedSalary

EstimatedSalary : Le salaire estimé du client

df['EstimatedSalary'].describe()

count    143579.000000
mean     112530.072465
std       50301.718378
min          11.580000
25%       74580.800000
50%      117931.100000
75%      155149.685000
max      199992.480000
Name: EstimatedSalary, dtype: float64

sns.histplot(data=df, x='EstimatedSalary')

sns.boxplot(data=df, x='EstimatedSalary')

fig, ax = plt.subplots(2, 1, sharex=True)
plt.suptitle('Répartition des salaires estimés des clients')
sns.histplot(data=df, x='EstimatedSalary', ax=ax[0])
sns.boxplot(data=df, x='EstimatedSalary', ax=ax[1])

Variable discrète : 'Exited

(La target) Exited : Si le client a résilié (1 = oui, 0 = non)

calcul des effectifs

df['Exited'].value_counts(normalize=False, sort=True, ascending=False)

Exited
0    113129
1     30450
Name: count, dtype: int64

distribution des effectifs

df['Exited'].value_counts(normalize=False, sort=True, ascending=False)

Exited
0    0.787922
1    0.212078
Name: proportion, dtype: float64

sns.scatterplot(data=df, x='Exited')

3. Traitement des données

EstimatedSalary: Normalization MinMax

Modèle de Classification KNeighborsClassifier

4. Make predictions

lecture du dataset de test

df = pd.read_csv(f'{os.path.dirname(__file__)}/kaggle/bank_churn_test_data.csv')

           ID  CustomerId  ... EstimatedSalary  Salary_Normalized
0       67897    15585246  ...        91830.75           0.459140
1      163075    15604551  ...        90876.95           0.454370
2      134760    15729040  ...        47777.15           0.238851
3       68707    15792329  ...        82696.84           0.413466
4        3428    15617166  ...       151887.16           0.759450
...       ...         ...  ...             ...                ...
21450   24790    15697574  ...       175072.47           0.875388
21451  152608    15682708  ...       156680.71           0.783420
21452   28134    15614215  ...       173599.38           0.868022
21453  123871    15587573  ...       161479.19           0.807415
21454   98510    15598070  ...        62390.59           0.311925

[21455 rows x 14 columns]

Prédictions

df = pd.read_csv(f'{os.path.dirname(__file__)}/kaggle/bank_churn_test_data.csv')
scaler = MinMaxScaler()
df['Salary_Normalized'] = scaler.fit_transform(df[['EstimatedSalary']])
X_new = df[['HasCrCard', 'IsActiveMember', 'Salary_Normalized']]
predictions = model.predict(X_new)

[0 0 0 ... 0 1 0]

Submission

results = pd.DataFrame(
    {
        'ID': df['ID'],
        'Exited': predictions
    }
)
results = results.set_index('ID')

        Exited
ID            
67897        0
163075       0
134760       0
68707        0
3428         0
...        ...
24790        0
152608       0
28134        0
123871       1
98510        0

[21455 rows x 1 columns]