Take Home Exam: Exploring the Depths of Tree-Based Ensemble Models
1. Introduction
In this project, we delve into the intricate world of tree-based ensemble models, using the renowned Iris dataset—a cornerstone in machine learning for classification tasks. The Iris dataset, easily accessible from the scikit-learn library, comprises 150 samples of iris flowers across three species (setosa, versicolor, and virginica), with four features describing each sample: sepal length, sepal width, petal length, and petal width.
Your task is to navigate through the complexities of ensemble modeling to predict the species of iris flowers based on these features. This endeavor will not only deepen your understanding of tree-based models and their ensemble but also sharpen your skills in manipulating data, engineering features, and critically evaluating model performance.
2. Objective
You are tasked with constructing a Gradient Boosting model from scratch to tackle this classification problem. Your model should incorporate key components of gradient boosting algorithms, such as decision trees as weak learners, a loss function to be minimized, and a mechanism for adding trees to the model in a way that reduces the overall prediction error.
Specific Requirements:
- Model Architecture: Utilize decision trees as the base learners. Start with simple trees and iteratively increase complexity as needed.
- Learning Process: Implement the gradient descent algorithm to minimize the loss function, specifically focusing on how trees are added to correct errors made by existing trees in the model.
- Hyperparameters: Manually configure and adjust at least three hyperparameters (e.g., learning rate, number of trees, and max depth of the trees) to observe their impact on model performance.
3. Evaluation
Your project will be evaluated based on the following criteria, totaling 100 points:
- Model Development (30 Points): This includes correctly implementing the gradient boosting model, ensuring it can be trained and used for predictions on the Iris dataset.
- Data Preprocessing (10 Points): Proper handling of the dataset, including splitting into training and testing sets, and any necessary preprocessing steps.
- Hyperparameter Tuning (20 Points): Demonstration of understanding and application of hyperparameter tuning, including the selection and adjustment of at least three hyperparameters and their impact on the model.
- Model Evaluation (20 Points): Use of appropriate metrics (e.g., accuracy, precision, recall, F1 Score) to evaluate model performance on the testing set. A comparison of results before and after hyperparameter tuning.
- Report and Interpretation (20 Points): A comprehensive report detailing your methodology, findings, the effect of hyperparameter tuning on the model, and insights on how tree-based ensemble models work.
Extra Mission - Expert Question (40 Points):
- Feature Importance Analysis: Extend your model to include an analysis of feature importance. Implement a method to calculate and visualize the importance of each feature in the dataset with respect to the model's predictive power. This requires a deep understanding of the model's internals and the ability to extract and interpret the contribution of each feature.
Note: The extra mission points are additive, raising the maximum possible score to 140 points. This challenge is designed to push your understanding and capabilities to the expert level, focusing on one of the key aspects that make tree-based models particularly valuable for practical machine learning tasks.