New York City Taxi Fare Prediction

1. Problem Statement

Picture yourself as a data scientist at a tech company that's partnering with the New York City Taxi and Limousine Commission (TLC) to improve transportation efficiency and transparency. The current taxi fare system is somewhat opaque, and passengers often feel uncertain about the final cost of their rides. The TLC aims to create a more predictable and fair pricing model that benefits both riders and drivers. This would greatly improve the current system and would be a big step for the people of New York.

You have been given a rich dataset of historical taxi trip data, including pickup and dropoff locations, timestamps, and passenger counts. Your task is to build a predictive model that accurately estimates taxi fares based on these factors. By incorporating real-time traffic conditions, time of day, and other relevant variables, you can create a tool that provides passengers with reliable fare estimates before they even enter the cab. This improved transparency will foster trust, reduce disputes, and enhance the overall taxi riding experience in New York City.

Goal: The goal of this project is to build a model that predicts the fare amount for a taxi ride in New York City based on pickup and dropoff locations and other factors.

2. Data Description

The dataset contains information about taxi rides in New York City, including pickup and dropoff locations, timestamps, and passenger count.

Data Source: Kaggle NYC Taxi Fare Prediction

Download Data

3. Your Task

Your task is to build a regression model to predict taxi fare amounts.

  1. Data Exploration and Preprocessing:
    • Load the dataset using Pandas.
    • Explore the data to understand the distribution of features.
    • Handle missing values, if any.
    • Convert the pickup_datetime column to appropriate datetime format.
  2. Feature Engineering (Crucial):
    • Calculate distance between pickup and dropoff locations.
    • Create features representing time-based components (e.g., hour, day of week, month).
  3. Model Building:
    • Split the data into training and validation sets.
    • Choose a suitable regression algorithm (e.g., Linear Regression, Random Forest, Gradient Boosting).
    • Train the model on the training data.
  4. Model Evaluation:
    • Evaluate the model's performance on the validation data.
    • Use metrics such as RMSE and MAE.
  5. Prediction and Submission:
    • Make predictions on the test data.
    • Format the submission file according to the competition guidelines.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.