Churn Training and Testing Data
Description
The final test and training set that we used to create our Logistic Classifier to predict if a customer churns or not.
Summary
GitHub Repository: https://github.com/ZacharySoo01/I310D_FinalProject
Telco Dataset: https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset
Collection:
We utilized the Telco customer churn: IBM dataset to get our original data.
Processing:
We first dropped these columns:
- CustomerID as it is just an identifier
- Count as all values were only "1"
- Location values like Country, State, Zip Code, Lat Long, Latitude, and Longitude as they had sensitive data
- Churn Label, Churn Score, CLTV, and Churn Reason, as to focus on the label, Churn Value
We then performed duplicate checks and removed duplicates in our dataset.
Afterward, we transformed our qualitative variables into quantitative ones (in the form of a boolean) using a user-created function transform.
Then, we dropped all of the null values in our dataset; additionally, the transform function made our booleans floats, so we converted them into ints.
We double-checked that there were no duplicates or null values left.
We then performed bivariate analysis to determine what features should be used in the model. We did this by comparing each column (feature) with the label (Churn Value) with the Linear Regression Test for Significance. From that, we found that we could drop these columns:
- Monthly Charges
- Gender
- Phone Service
- Multiple Lines
We then used train/test from sci-kit learn to split our dataset into a test set and a training set. Our test size is 20% of the dataset, and the random_state was 42.
Analysis:
We then used the training and test data to create a Logistic Classifier that would predict whether customers would churn or not.
We got the accuracy, precision, recall, and F1 score of our classifier: 76%, .62, .68, and .65 respectively.
We had an AUC of about .73, meaning that our model could somewhat distinguish positive cases from negative cases. We created a confusion matrix, and we got 520 TN, 214 TP, 129 FN, and 102 FP.
Using LIME on a random datapoint in our test set, we could see which features led the classifier to predict whether a customer would churn or not. For this specific example, the features that led the model to predict that a customer would churn were:
- Tenure Months
- Dependents
- Internet Service
- Online Security
- Tech Support
- Contract
- Online Backup
- Streaming Movies