Regression of Used Car Prices

1. Problem Statement

You're a data scientist at a company that buys and sells used cars, operating in a highly competitive market where accurate pricing is crucial for success. The company needs a reliable and data-driven way to determine the fair market value of used vehicles, but the current process relies on manual appraisals and subjective assessments, leading to inconsistencies and missed opportunities. As a result, the company may be overpaying for inventory or pricing cars too high, leading to lost sales. You have been tasked with creating a more efficient and accurate pricing strategy.

You have been given a dataset that includes features such as brand, model, year, mileage, fuel type, and condition of the car, as well as the sale price. Your goal is to build a regression model that can accurately predict the sale price of used cars. This predictive tool will enable the company to price cars competitively, optimize inventory management, and maximize profit margins. By automating the pricing process and reducing reliance on manual appraisals, you'll contribute to increased efficiency and profitability for the business.

Goal: The goal of this project is to build a regression model that can accurately predict the prices of used cars.

2. Data Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Used Car Price Prediction Dataset.

Rows: 188534
Columns: 13
Variables:
- id: Unique identifier
- brand: Brand of the car
- model: Model of the car
- model_year: Model year
- mileage: Mileage
- fuel_type: Fuel type
- engine: Engine
- transmission: Transmission type
- ext_col: Exterior color
- int_col: Interior color
- accident: Accident history
- clean_title: Clean title status
- price: Price (target variable)

Data Source: Kaggle Playground Series S4E9

Download Data

3. Your Task

Your task is to build a regression model to predict used car prices.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Convert categorical variables to numerical format using techniques like one-hot encoding.
- Scale numerical features to ensure fair treatment in the model.
Feature Engineering (Optional):
- Create new features that may improve model performance.
Model Building:
- Split the data into training and validation sets.
- Choose a suitable regression algorithm (e.g., Linear Regression, Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as RMSE, MAE, and R-squared.
- Visualize the results using scatter plots.
Prediction and Submission:
- Make predictions on the test data.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.