Google Analytics Customer Revenue Prediction

1. Problem Statement

Imagine that you are a data scientist for a big marketing firm, hired to improve your client's sales by developing a powerful predictive model. One of your largest clients is the Google Merchandise Store, where they sell Google swag and various digital devices. The store wants to better understand where they are earning (and not earning) their revenue and what factors drive the highest revenue, but their current analytical methods are too high-level to provide actionable insights. The CEO of the store has high hopes for your abilities.

You are given a huge dataset derived from Google Analytics, filled with customer interactions and behavior. Your primary task is to build a model that can accurately predict revenue per customer, identify key drivers of revenue, and pinpoint customer segments with high potential. These efforts will allow you to provide actionable recommendations to the marketing and product teams. It is important for the company to understand their customer interactions and optimize their strategies.

Goal: The goal of this project is to build a model that can accurately predict revenue per customer.

2. Data Description

The dataset contains customer interactions and transactions from the Google Merchandise Store. It includes information about visits, devices, geography, traffic sources, and more.

Rows: 903654
Columns: 12
Variables:
- fullVisitorId: Unique identifier for each user
- channelGrouping: Channel via which the user came to the Store
- date: The date on which the user visited the Store
- device: Specifications for the device used to access the Store
- geoNetwork: Information about the geography of the user
- socialEngagementType: Engagement type
- totals: Aggregate values across the session
- trafficSource: Information about the Traffic Source from which the session originated
- visitId: Identifier for this session
- visitNumber: The session number for this user
- visitStartTime: The timestamp
- hits: Row and nested fields for all types of hits
- customDimensions: User-level or session-level custom dimensions

Data Source: Kaggle GA Customer Revenue Prediction

Download Data

3. Your Task

Your task is to predict the natural log of the sum of all transactions per user.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Convert JSON blobs and other complex data types to usable formats.
Feature Engineering (Crucial):
- Extract meaningful features from JSON columns.
- Create aggregate features.
Model Building:
- Split the data into training and validation sets.
- Choose a suitable regression algorithm (e.g., XGBoost, LightGBM).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as RMSE.
Prediction and Submission:
- Make predictions on the test data.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn, json, XGBoost/LightGBM.