Crunching Hashed Data to Build a Propensity Model in Python

3 min readMar 21, 2023

Unlock the power of hashed data for predictive analytics with Python

Introduction

In the era of data-driven decision-making, propensity models are becoming an invaluable tool for marketers and businesses alike. They enable organizations to predict the likelihood of a customer taking a specific action, such as making a purchase, by analyzing historical data.

In this blog post, we will discuss how to harness hashed data to build a propensity model in Python. Hashed data is a common technique used to protect sensitive information, while still preserving its utility for analytics. We’ll walk you through the process step-by-step, from data preparation to model evaluation.

Step 1: Data Preparation

To build a propensity model, we first need to prepare our hashed data. The data should include features relevant to predicting customer behavior, such as demographics, transaction history, and engagement metrics.

Importing Libraries and Loading Data

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Load your hashed dataset
data = pd.read_csv('hashed_data.csv')

Feature Engineering

Before building the model, it’s essential to engineer relevant features from the hashed data. If your hashed data includes date or time information, you can extract valuable features such as recency or frequency of actions.

# Example: Calculate the recency of a customer's last transaction
data['transaction_date'] = pd.to_datetime(data['hashed_transaction_date'], unit='s')
data['recency'] = (data['transaction_date'].max() - data['transaction_date']).dt.days

Encoding Categorical Variables

For categorical variables, we’ll use label encoding to convert hashed values into numerical representations. This ensures that our propensity model can process the data correctly.

label_encoder = LabelEncoder()
categorical_columns = ['hashed_gender', 'hashed_country', 'hashed_source']

for column in categorical_columns:
    data[column + '_encoded'] = label_encoder.fit_transform(data[column])

Splitting the Dataset

Now that our data is prepared, we can split it into training and testing sets. This allows us to evaluate our propensity model’s performance.

# Define the target variable (e.g., whether the customer made a purchase)
data['target'] = data['hashed_purchase'].apply(lambda x: 1 if x == 'hashed_purchase_yes' else 0)

X = data.drop(columns=['target'])
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 2: Building the Propensity Model

With our data prepared, we can now build a propensity model using a machine learning algorithm. In this example, we’ll use a RandomForestClassifier, but you can experiment with other algorithms to find the best fit for your data.

# Initialize the RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)[:, 1]

Step 3: Model Evaluation

To evaluate our propensity model, we’ll use classification metrics such as precision, recall, and F1-score. Additionally, we’ll compute the ROC-AUC score to measure the model’s performance in distinguishing between positive and negative classes.

# Calculate classification metrics
print(classification_report(y_test, y_pred))

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc}")

If the model’s performance is not satisfactory, consider experimenting with different algorithms, adjusting hyperparameters, or adding new features to improve its predictive power.

Step 4: Deploying the Propensity Model

Once you’ve built and evaluated your propensity model, you can use it to predict the likelihood of customers taking specific actions. For example, you can apply the model to your customer base to predict which customers are most likely to make a purchase, and then target them with personalized marketing campaigns.

# Make predictions for a new customer
new_customer_data = pd.DataFrame({"hashed_gender_encoded": [1],
                                  "hashed_country_encoded": [45],
                                  "hashed_source_encoded": [3],
                                  "recency": [15]})

purchase_probability = clf.predict_proba(new_customer_data)[:, 1]
print(f"Purchase Probability: {purchase_probability[0]}")

Conclusion

In this blog post, we’ve demonstrated how to build a propensity model using hashed data in Python. By following these steps, you can leverage hashed data to predict customer behavior, allowing you to optimize your marketing strategies and improve customer satisfaction. As a next step, consider exploring advanced techniques such as feature selection, hyperparameter tuning, or ensemble learning to further refine your propensity model.