Loan Approval Prediction

1. Problem Statement

You have decided to pursue the role of data scientist for an important microfinance institution. You will work as a credit analyst, who is in charge of the loan approval for different customers. You quickly notice that their old classification methods of providing loans have a lot of failures, where the loans cannot be paid, causing serious financial problems for the institution and its clients. The CEO knows that change is necessary and has promised to devote resources to improve the process.

As a data scientist, you now know that you can use machine learning and design a model that better classifies loans. By leveraging features like credit history, income, employment length, and loan intent, you can predict which applicants are most likely to repay their loans. This improved classification system will not only reduce the institution's financial risks but also enable it to extend credit responsibly to underserved communities, fulfilling its mission of financial inclusion.

Goal: The goal of this project is to build a model that can accurately predict loan approvals.

2. Data Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Loan Approval Prediction dataset.

Rows: 58646
Columns: 13
Variables:
- id: Unique identifier
- person_age: Person age
- person_income: Person income
- person_home_ownership: Person home ownership
- person_emp_length: Person employment length
- loan_intent: Loan intent
- loan_grade: Loan grade
- loan_amnt: Loan amount
- loan_int_rate: Loan interest rate
- loan_percent_income: Loan percent income
- cb_person_default_on_file: CB person default on file
- cb_person_cred_hist_length: CB person credit history length
- loan_status: Loan status (target variable)

Data Source: Kaggle Playground Series S4E10

Download Train Data

3. Your Task

Your task is to build a classification model to predict loan approvals.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Convert categorical variables to numerical format using techniques like one-hot encoding.
Feature Engineering (Optional):
- Create new features that may improve model performance.
Model Building:
- Split the data into training and validation sets.
- Choose a suitable classification algorithm (e.g., Logistic Regression, Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as accuracy, precision, recall, F1-score, and AUC-ROC.
Prediction and Submission:
- Make predictions on the test data.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.