New York City Taxi Trip Duration

1. Problem Statement

As a talented data scientist for the New York City Department of Transportation (DOT), reducing traffic congestion is a top priority. Chronic traffic delays not only frustrate commuters but also negatively impact the city's economy and air quality. You have an opportunity to leverage data to optimize traffic flow and improve the overall transportation system in New York City, which is a huge problem. The government has pledged resources to address your solutions.

You have access to a comprehensive dataset of taxi trip data, including pickup and dropoff locations, timestamps, and various trip attributes. Your task is to build a predictive model that accurately estimates the duration of taxi trips based on these factors. By analyzing historical patterns and incorporating real-time traffic conditions, you can provide valuable insights to the DOT for optimizing traffic signal timings, identifying congestion hotspots, and implementing targeted traffic management strategies. Your work will directly contribute to a more efficient, sustainable, and livable urban environment for millions of New Yorkers.

Goal: The goal of this project is to build a model that predicts the total ride duration of taxi trips in New York City.

2. Data Description

The dataset contains information about taxi trips in New York City, including pickup and dropoff locations, timestamps, and passenger count. It contains real trip data that was sampled and cleaned, and based on individual trip attributes should predict the duration of each trip in the test set

Rows: 1048576
Columns: 11
Variables:
- id: a unique identifier for each trip
- vendor_id: a code indicating the provider associated with the trip record
- pickup_datetime: date and time when the meter was engaged
- dropoff_datetime: date and time when the meter was disengaged
- passenger_count: the number of passengers in the vehicle (driver entered value)
- pickup_longitude: the longitude where the meter was engaged
- pickup_latitude: the latitude where the meter was engaged
- dropoff_longitude: the longitude where the meter was disengaged
- dropoff_latitude: the latitude where the meter was disengaged
- store_and_fwd_flag: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
- trip_duration: duration of the trip in seconds (target variable)

Data Source: Kaggle NYC Taxi Trip Duration

Download Data

3. Your Task

Your task is to build a regression model to predict the duration of taxi trips in New York City.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Convert the pickup and dropoff datetime columns to appropriate datetime format.
Feature Engineering (Crucial):
- Calculate distance between pickup and dropoff locations.
- Create features representing time-based components (e.g., hour, day of week, month).
Model Building:
- Split the data into training and validation sets.
- Choose a suitable regression algorithm (e.g., Linear Regression, Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as RMSE and MAE. Use transformation on the target column.
Prediction and Submission:
- Make predictions on the test data.
- Format the submission file according to the competition guidelines.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.