Steel Plate Defect Prediction

1. Problem Statement

Envision you're an analyst at a steel manufacturing plant where quality control is a top priority. The plant produces steel plates for various industries, but defects occasionally occur, leading to production delays, increased material waste, and compromised product quality. Identifying these defects early in the manufacturing process is critical to maintaining customer satisfaction and reducing costs. The plant manager is seeking a data-driven solution to enhance their defect detection capabilities.

You have been given a dataset with features related to the shape, size, and other characteristics of steel plates, along with labels indicating the presence and type of defect. Your task is to build a multi-label classification model that can accurately predict the presence of each defect type. By leveraging machine learning techniques, you will help streamline the manufacturing process, reduce material waste, and improve overall product quality, leading to increased customer satisfaction and cost savings. Your model will play a key role in ensuring that only high-quality steel plates are shipped to customers, reinforcing the plant's reputation for excellence.

Goal: The goal of this project is to build a classification model that can accurately predict the presence of various defects in steel plates.

2. Data Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Steel Plates Faults dataset from UCI. Feature distributions are close to, but not exactly the same, as the original.

Rows: 19220
Columns: 35
Variables:
- id: Unique identifier
- X_Minimum: Minimum X coordinate
- X_Maximum: Maximum X coordinate
- Y_Minimum: Minimum Y coordinate
- Y_Maximum: Maximum Y coordinate
- Pixels_Areas: Area of pixels
- X_Perimeter: X Perimeter
- Y_Perimeter: Y Perimeter
- Sum_of_Luminosity: Sum of luminosity
- Minimum_of_Luminosity: Minimum of luminosity
- Maximum_of_Luminosity: Maximum of luminosity
- Length_of_Conveyer: Length of conveyer
- TypeOfSteel_A300: Type of steel A300
- TypeOfSteel_A400: Type of steel A400
- Steel_Plate_Thickness: Steel plate thickness
- Edges_Index: Edges index
- Empty_Index: Empty index
- Square_Index: Square index
- Outside_X_Index: Outside X index
- Edges_X_Index: Edges X index
- Edges_Y_Index: Edges Y index
- Outside_Global_Index: Outside global index
- LogOfAreas: Log of areas
- Log_X_Index: Log of X index
- Log_Y_Index: Log of Y index
- Orientation_Index: Orientation index
- Luminosity_Index: Luminosity index
- SigmoidOfAreas: Sigmoid of areas
- Pastry: Pastry (target variable)
- Z_Scratch: Z Scratch (target variable)
- K_Scatch: K Scratch (target variable)
- Stains: Stains (target variable)
- Dirtiness: Dirtiness (target variable)
- Bumps: Bumps (target variable)
- Other_Faults: Other Faults (target variable)

Data Source: Kaggle Playground Series S4E3

Download Data

3. Your Task

Your task is to build a multi-label classification model to predict the presence of each defect type.

Data Exploration and Preprocessing:
- Load the dataset using Pandas.
- Explore the data to understand the distribution of features.
- Handle missing values, if any.
- Scale numerical features to ensure fair treatment in the model.
Feature Engineering (Optional):
- Create new features that may improve model performance.
Model Building:
- Split the data into training and validation sets.
- Choose a suitable multi-label classification algorithm (e.g., Random Forest, Gradient Boosting).
- Train the model on the training data.
Model Evaluation:
- Evaluate the model's performance on the validation data.
- Use metrics such as accuracy, precision, recall, F1-score, and AUC-ROC.
Prediction and Submission:
- Make predictions on the test data.

Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn.