Churn Training and Testing Data

timmchong·January 10, 2025

Description

The final test and training set that we used to create our Logistic Classifier to predict if a customer churns or not.

Summary


GitHub Repository: https://github.com/ZacharySoo01/I310D_FinalProject

Telco Dataset: https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset

Collection:

We utilized the Telco customer churn: IBM dataset to get our original data.

Processing:

We first dropped these columns:

  • CustomerID as it is just an identifier
  • Count as all values were only "1"
  • Location values like Country, State, Zip Code, Lat Long, Latitude, and Longitude as they had sensitive data
  • Churn Label, Churn Score, CLTV, and Churn Reason, as to focus on the label, Churn Value

We then performed duplicate checks and removed duplicates in our dataset.

Afterward, we transformed our qualitative variables into quantitative ones (in the form of a boolean) using a user-created function transform.

Then, we dropped all of the null values in our dataset; additionally, the transform function made our booleans floats, so we converted them into ints.

We double-checked that there were no duplicates or null values left.

We then performed bivariate analysis to determine what features should be used in the model. We did this by comparing each column (feature) with the label (Churn Value) with the Linear Regression Test for Significance. From that, we found that we could drop these columns:

  • Monthly Charges
  • Gender
  • Phone Service
  • Multiple Lines

We then used train/test from sci-kit learn to split our dataset into a test set and a training set. Our test size is 20% of the dataset, and the random_state was 42.

Analysis:

We then used the training and test data to create a Logistic Classifier that would predict whether customers would churn or not.

We got the accuracy, precision, recall, and F1 score of our classifier: 76%, .62, .68, and .65 respectively.

We had an AUC of about .73, meaning that our model could somewhat distinguish positive cases from negative cases. We created a confusion matrix, and we got 520 TN, 214 TP, 129 FN, and 102 FP.

Using LIME on a random datapoint in our test set, we could see which features led the classifier to predict whether a customer would churn or not. For this specific example, the features that led the model to predict that a customer would churn were:

  • Tenure Months
  • Dependents
  • Internet Service
  • Online Security
  • Tech Support
  • Contract
  • Online Backup
  • Streaming Movies ​ ​

Basic info
Author
timmchong
Shared withEveryone
CreatedApril 25, 2023
Size224 KB
LicenseCC-0
Dictionary4 tables
Original URLGo to check
Publishedimage
public datasets
Advanced features
Insights
Based on the provided information of the dataset, would it be possible to provide some relevant inquiries?
What columns are included in the document?
How many rows does the document contain?

Churn Training and Testing Data

timmchong·January 10, 2025

Description

The final test and training set that we used to create our Logistic Classifier to predict if a customer churns or not.

Summary


GitHub Repository: https://github.com/ZacharySoo01/I310D_FinalProject

Telco Dataset: https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset

Collection:

We utilized the Telco customer churn: IBM dataset to get our original data.

Processing:

We first dropped these columns:

  • CustomerID as it is just an identifier
  • Count as all values were only "1"
  • Location values like Country, State, Zip Code, Lat Long, Latitude, and Longitude as they had sensitive data
  • Churn Label, Churn Score, CLTV, and Churn Reason, as to focus on the label, Churn Value

We then performed duplicate checks and removed duplicates in our dataset.

Afterward, we transformed our qualitative variables into quantitative ones (in the form of a boolean) using a user-created function transform.

Then, we dropped all of the null values in our dataset; additionally, the transform function made our booleans floats, so we converted them into ints.

We double-checked that there were no duplicates or null values left.

We then performed bivariate analysis to determine what features should be used in the model. We did this by comparing each column (feature) with the label (Churn Value) with the Linear Regression Test for Significance. From that, we found that we could drop these columns:

  • Monthly Charges
  • Gender
  • Phone Service
  • Multiple Lines

We then used train/test from sci-kit learn to split our dataset into a test set and a training set. Our test size is 20% of the dataset, and the random_state was 42.

Analysis:

We then used the training and test data to create a Logistic Classifier that would predict whether customers would churn or not.

We got the accuracy, precision, recall, and F1 score of our classifier: 76%, .62, .68, and .65 respectively.

We had an AUC of about .73, meaning that our model could somewhat distinguish positive cases from negative cases. We created a confusion matrix, and we got 520 TN, 214 TP, 129 FN, and 102 FP.

Using LIME on a random datapoint in our test set, we could see which features led the classifier to predict whether a customer would churn or not. For this specific example, the features that led the model to predict that a customer would churn were:

  • Tenure Months
  • Dependents
  • Internet Service
  • Online Security
  • Tech Support
  • Contract
  • Online Backup
  • Streaming Movies ​ ​

Scan the QR code to ask more questions about the paper
© 2025 Powerdrill. All rights reserved.