Regression with a Tabular Media Campaign Cost Dataset

1. Problem Statement

As a marketing analyst at a successful advertising agency, you've observed inconsistencies in media campaign costs across different projects. Some campaigns unexpectedly exceed their budgets, while others underperform in terms of reach and engagement. The current process for estimating campaign costs is largely based on industry averages and doesn't account for the unique characteristics of each client or target audience. The agency needs a more precise and data-driven approach to budget allocation.

You have access to a comprehensive dataset that captures various aspects of past media campaigns, including reach, demographics, channels, and expenditure. Your task is to construct a robust regression model that accurately predicts campaign costs based on these factors. By identifying key cost drivers and building a reliable forecasting tool, you will enable the agency to allocate budgets more efficiently, negotiate better rates with media outlets, and maximize the return on investment for their clients' marketing campaigns.

Goal: The goal of this project is to build a regression model that can accurately predict the cost of media campaigns.

2. Data Description

The dataset contains information about various media campaigns, including sales, units, children, and other factors.

Rows: 360337
Columns: 17
Variables:
- id: Unique identifier
- store_sales: Store sales (in millions)
- unit_sales: Unit sales (in millions)
- total_children: Total children
- num_children_at_home: Number of children at home
- avg_cars_at home: Average cars at home
- gross_weight: Gross weight
- recyclable_package: Recyclable package
- low_fat: Low fat
- units_per_case: Units per case
- store_sqft: Store square footage
- coffee_bar: Coffee bar
- video_store: Video store
- salad_bar: Salad bar
- prepared_food: Prepared food
- florist: Florist
- cost: Cost (target variable)

Data Source: Kaggle Playground Series S3E11

Download Data

3. Your Task

Your task is to build a regression model to predict media campaign costs.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Scale numerical features to ensure fair treatment in the model.
Feature Engineering (Optional):
- Create new features that may improve model performance.
Model Building:
- Split the data into training and validation sets.
- Choose a suitable regression algorithm (e.g., Linear Regression, Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as RMSE, MAE, and R-squared.
Prediction and Submission:
- Make predictions on the test data.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.