Allstate Purchase Prediction Challenge

1. Problem Statement

Imagine that you're a key consultant hired by Allstate, one of the nation's largest insurance providers, to help them understand and improve their customer purchase behavior. The company recognizes that the insurance landscape is becoming increasingly competitive, with customers having more choices than ever before. Allstate wants to leverage data science to enhance customer relationships, tailor product offerings, and ultimately increase sales. However, they realize this is difficult given the complexity involved.

You are tasked with analyzing customer data – demographics, past interactions, policy quotes, and purchase history – to build a model that predicts which coverage options customers are most likely to select. By identifying the factors that drive customer choices, you will be able to provide Allstate with actionable insights for personalizing marketing messages, streamlining the quoting process, and developing new product bundles that better meet customer needs. Your success will translate into increased customer satisfaction, improved retention rates, and a significant boost to Allstate's overall sales performance.

Goal: The goal of this project is to build a model that can predict the purchased coverage options using a customer's shopping history.

2. Data Description

The dataset contains transaction history for customers who ended up purchasing a policy. It includes customer information, quoted policy information, and costs. The training set contains the entire quote history for each customer, with the last row containing the purchased coverage options. The test set contains a partial history of the quotes, and the task is to predict the purchased coverage options.

Rows: 665250
Columns: 25
Variables:
- customer_ID: A unique identifier for the customer
- shopping_pt: Unique identifier for the shopping point of a given customer
- record_type: 0=shopping point, 1=purchase point
- day: Day of the week (0-6, 0=Monday)
- time: Time of day (HH:MM)
- state: State where shopping point occurred
- location: Location ID where shopping point occurred
- group_size: How many people will be covered under the policy (1, 2, 3 or 4)
- homeowner: Whether the customer owns a home or not (0=no, 1=yes)
- car_age: Age of the customer's car
- car_value: How valuable was the customer's car when new
- risk_factor: An ordinal assessment of how risky the customer is (1, 2, 3, 4)
- age_oldest: Age of the oldest person in customer's group
- age_youngest: Age of the youngest person in customer's group
- married_couple: Does the customer group contain a married couple (0=no, 1=yes)
- C_previous: What the customer formerly had or currently has for product option C (0=nothing, 1, 2, 3,4)
- duration_previous: How long (in years) the customer was covered by their previous issuer
- A,B,C,D,E,F,G: the coverage options
- cost: Cost of the quoted coverage options

Data Source: Kaggle Allstate Purchase Prediction Challenge

Download Data

3. Your Task

Your task is to predict the seven coverage options (A, B, C, D, E, F, G) that each customer will end up purchasing.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Convert categorical variables to numerical format.
Feature Engineering (Crucial):
- Aggregate features from the shopping history.
- Create lag features to capture the customer's decision-making process.
Model Building:
- Split the data into training and validation sets. Consider using group-based splitting.
- Choose a suitable multi-label classification algorithm (e.g., Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use appropriate metrics for multi-label classification.
Prediction and Submission:
- Make predictions on the test data.
- Format the submission file according to the competition guidelines.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.