Google Analytics Customer Revenue Prediction

1. Problem Statement

Imagine that you are a data scientist for a big marketing firm, hired to improve your client's sales by developing a powerful predictive model. One of your largest clients is the Google Merchandise Store, where they sell Google swag and various digital devices. The store wants to better understand where they are earning (and not earning) their revenue and what factors drive the highest revenue, but their current analytical methods are too high-level to provide actionable insights. The CEO of the store has high hopes for your abilities.

You are given a huge dataset derived from Google Analytics, filled with customer interactions and behavior. Your primary task is to build a model that can accurately predict revenue per customer, identify key drivers of revenue, and pinpoint customer segments with high potential. These efforts will allow you to provide actionable recommendations to the marketing and product teams. It is important for the company to understand their customer interactions and optimize their strategies.

Goal: The goal of this project is to build a model that can accurately predict revenue per customer.

2. Data Description

The dataset contains customer interactions and transactions from the Google Merchandise Store. It includes information about visits, devices, geography, traffic sources, and more.

Data Source: Kaggle GA Customer Revenue Prediction

Download Data

3. Your Task

Your task is to predict the natural log of the sum of all transactions per user.

  1. Data Exploration and Preprocessing:
    • Load the dataset using Pandas.
    • Explore the data to understand the distribution of features.
    • Handle missing values, if any.
    • Convert JSON blobs and other complex data types to usable formats.
  2. Feature Engineering (Crucial):
    • Extract meaningful features from JSON columns.
    • Create aggregate features.
  3. Model Building:
    • Split the data into training and validation sets.
    • Choose a suitable regression algorithm (e.g., XGBoost, LightGBM).
    • Train the model on the training data.
  4. Model Evaluation:
    • Evaluate the model's performance on the validation data.
    • Use metrics such as RMSE.
  5. Prediction and Submission:
    • Make predictions on the test data.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn, json, XGBoost/LightGBM.