Porto Seguro's Safe Driver Prediction

1. Problem Statement

As a highly valued data scientist at Porto Seguro, one of Brazil's largest auto and homeowner insurance companies, you recognize the critical impact of accurate risk assessment on the company's financial stability and customer satisfaction. The ability to differentiate between safe and risky drivers is paramount for setting appropriate premiums, minimizing claim payouts, and maintaining a healthy bottom line. Inaccurate risk models can lead to significant losses, alienate safe drivers with unfairly high rates, and attract risky drivers with artificially low premiums, ultimately jeopardizing the company's long-term sustainability. The CEO of Porto Seguro sees this problem as critical and has hired you to try to resolve.

Therefore, Your mission is to use your data science prowess to develop a model that accurately predicts the probability that a driver will file an insurance claim in the coming year. This requires careful analysis of a rich dataset that encompasses a variety of driver characteristics, policy details, and historical claim data. By identifying the key factors that contribute to safe driving behavior, you can create a robust prediction tool that enables Porto Seguro to offer more competitive rates to its safest customers, reduce the financial burden on conscientious drivers, and ultimately solidify the company's reputation as a fair and responsible insurer, ensuring a safer and more secure future for all its stakeholders.

Goal: The goal of this project is to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.

2. Data Description

The dataset contains anonymized policy and claim information. Features belonging to similar groupings are tagged in the feature names (e.g., ind, reg, car, calc). Features with 'bin' indicate binary features, and 'cat' indicates categorical features.

Rows: 595213
Columns: 59
Variables:
- Features tagged in feature names (e.g. `ind`, `reg`, `car`, `calc` )
- features named `_bin` to indicate binary features
- categorical features indicated by `cat`
- Values of -1 indicate that the feature was missing from the observation
- Target : target columns signifies that a claim was filed.

Data Source: Kaggle Porto Seguro Safe Driver Prediction

Download Data

3. Your Task

Your task is to build a classification model to predict the probability that an auto insurance policy holder files a claim.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Convert categorical variables to numerical format.
- Be mindful of the data having many anonymous variables
Feature Engineering (Optional):
- Create new features that may improve model performance.
Model Building:
- Split the data into training and validation sets.
- Choose a suitable classification algorithm (e.g., Logistic Regression, Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as accuracy, precision, recall, F1-score, and AUC-ROC. Because the dataset is imbalanced, you must pay extra attention when evaluating this model.
Prediction and Submission:
- Make predictions on the test data.
- Format the submission file according to the competition guidelines.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.