Python Data Analysis in Practice: Building a Customer Churn Prediction System from Scratch-Healthy Living Strategies

Current Location:

Home

Python Data Analysis in Practice: Building a Customer Churn Prediction System from Scratch

Background

Have you ever encountered this problem: your company invests substantial resources in acquiring new customers, yet consistently loses existing ones without noticing? As a data analyst, I deeply understand the importance of customer churn prediction for businesses. Today, I'd like to share how to build a complete customer churn prediction system using Python.

Preparation

Before we begin, we need to make thorough preparations. First, we need to understand that customer churn prediction is essentially a classification problem - we need to predict whether customers will churn. This requires us to learn from historical data and identify key factors that lead to customer churn.

Let's first import the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

Data Acquisition

Data acquisition is the foundation of the entire analysis process. In practice, I've found that data is often scattered across different systems and needs to be integrated. For example:

Basic customer information (possibly from CRM systems)
Transaction records (from order systems)
Customer service records (from customer service systems)
User behavior data (from websites or apps)

Let's see how to handle this data:

def load_customer_data():
    # Assume this data is read from a database
    customer_data = pd.DataFrame({
        'customer_id': range(1000),
        'age': np.random.randint(18, 70, 1000),
        'tenure': np.random.randint(0, 60, 1000),
        'monthly_charges': np.random.uniform(30, 150, 1000),
        'total_charges': np.random.uniform(100, 5000, 1000),
        'gender': np.random.choice(['Male', 'Female'], 1000),
        'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], 1000),
        'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], 1000),
        'churn': np.random.choice([0, 1], 1000, p=[0.7, 0.3])
    })
    return customer_data

data = load_customer_data()

Cleaning

Data cleaning might be the most time-consuming step, but it's absolutely crucial. I often say, "Garbage in, garbage out." A good model first requires high-quality data support.

def clean_data(df):
    # Check and handle missing values
    print("Missing values:")
    print(df.isnull().sum())

    # Handle outliers
    numeric_columns = ['age', 'tenure', 'monthly_charges', 'total_charges']
    for col in numeric_columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df[col] = df[col].clip(lower_bound, upper_bound)

    # Handle categorical variables
    categorical_columns = ['gender', 'contract_type', 'payment_method']
    for col in categorical_columns:
        df[col] = pd.Categorical(df[col])

    return df

data = clean_data(data)

Exploration

Exploratory Data Analysis (EDA) is my favorite part. Through visualization and statistical analysis, we can discover hidden patterns and regularities in the data. These findings often bring direct value to the business.

def explore_data(df):
    # Distribution of numerical variables
    numeric_cols = ['age', 'tenure', 'monthly_charges', 'total_charges']
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    for i, col in enumerate(numeric_cols):
        sns.boxplot(x='churn', y=col, data=df, ax=axes[i//2, i%2])
        axes[i//2, i%2].set_title(f'{col} by Churn Status')
    plt.tight_layout()
    plt.show()

    # Analysis of categorical variables
    categorical_cols = ['gender', 'contract_type', 'payment_method']
    fig, axes = plt.subplots(1, 3, figsize=(20, 5))
    for i, col in enumerate(categorical_cols):
        df_grouped = df.groupby(col)['churn'].mean().sort_values(ascending=False)
        df_grouped.plot(kind='bar', ax=axes[i])
        axes[i].set_title(f'Churn Rate by {col}')
        axes[i].set_ylabel('Churn Rate')
    plt.tight_layout()
    plt.show()

explore_data(data)

Modeling

Model selection and training is the most technically demanding part. We need to choose appropriate models based on data characteristics and business requirements. In this example, I chose the Random Forest algorithm because it has the following advantages:

Can handle both numerical and categorical features
Doesn't require feature scaling
Can evaluate feature importance
Less prone to overfitting

def prepare_features(df):
    # One-hot encoding for categorical variables
    categorical_cols = ['gender', 'contract_type', 'payment_method']
    df_encoded = pd.get_dummies(df, columns=categorical_cols)

    # Separate features and target variable
    X = df_encoded.drop(['churn', 'customer_id'], axis=1)
    y = df_encoded['churn']

    return train_test_split(X, y, test_size=0.2, random_state=42)

def train_model(X_train, X_test, y_train, y_test):
    # Train Random Forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate model performance
    y_pred = model.predict(X_test)
    print("
Classification Report:")
    print(classification_report(y_test, y_pred))

    # Feature importance analysis
    feature_importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)

    return model, feature_importance

X_train, X_test, y_train, y_test = prepare_features(data)
model, feature_importance = train_model(X_train, X_test, y_train, y_test)

Application

After model training, how do we apply it to actual business? Here I'll share some practical experience:

Model Deployment: Package the model as an API service for other systems to call.

def predict_churn_probability(customer_data, model):
    # Preprocess new data
    processed_data = clean_data(customer_data)
    X = prepare_features(processed_data)[0]  # Only need feature matrix

    # Predict churn probability
    churn_prob = model.predict_proba(X)[:, 1]

    # Return results
    results = pd.DataFrame({
        'customer_id': customer_data['customer_id'],
        'churn_probability': churn_prob
    })
    return results


new_customers = load_customer_data()  # Assume this is new customer data
predictions = predict_churn_probability(new_customers, model)

Reflections

After this project, I have the following thoughts and suggestions:

Data Quality is Crucial
Ensure data completeness and accuracy
Establish data quality monitoring mechanisms
Regularly update and maintain data pipelines
Feature Engineering is Key
Deep understanding of business implications
Create discriminative features
Pay attention to feature correlation
Model Selection Should Be Moderate
Don't blindly pursue complex models
Balance model performance and interpretability
Consider practical deployment and maintenance costs
Continuous Optimization is Important
Regularly evaluate model performance
Collect user feedback
Adjust prediction strategies timely

Finally, I want to say that customer churn prediction is not just a technical problem, but also a business problem. We need to transform technical output into business value - that's the ultimate goal of data analysis.

What do you think about this analysis process? Feel free to share your thoughts and experiences in the comments.

Appendix

Here I've compiled some common problems and solutions:

Data Imbalance Issues
Use oversampling/undersampling
Adjust class weights
Use ensemble learning methods
Feature Selection Strategies
Based on correlation analysis
Based on feature importance
Based on domain knowledge
Model Tuning Tips
Grid search for optimal parameters
Cross-validation to avoid overfitting
Feature importance analysis
Performance Optimization Solutions
Feature pre-computation
Model compression
Batch prediction

Remember, there's no perfect solution, only the solution that best fits your business scenario. Continuous learning and practice are key to improving analytical capabilities.

Python Data Analysis from Beginner to Practice: A Comprehensive Guide to Pandas and Data Processing Techniques