What is XGBoost?
- Optimized gradient-boosting machine learning library
- Originally ritten in C++
Has API in several languages:
What n=makes XGBoost so popular?
- Speed and Perfomance
- Core algorithm is parallelizable across CPU and GPU cores and also multuple GPUs across networks of computers.
Create a Bootstrapped Dataset.
- Same size as the original.
- We are allowed to pick the same sample more than once.
Create a Decision Tree using the bootstraped dataset, but only use a random subset of variables (or columns) at each step.
- Let's say you have [age, height, weight, gender], you only use 2 variable out of 4.
- There is a better way to find the optimal value (using Grid Search)
Split the Root Node (randomly select 2 variables, say [age, weight]).
- Use Gini/Entrophy to choose which one gives the proper split.
- And we build the tree as usual, but only considering a random subset variables (2 in our case) at each step.
Now goto step 1 and repeat.
- Make a new bootstraped dataset.
- Build a tree considering a subset of variables at each step.
- YOu build 100's of trees.
Now we have the trees, how do we use it?
Take a new data and pass it through all the trees.
- 80 trees yes, 20 trees no
- Yes received the most votes, so we will conclude Yes as the output.
Bootstraping the data + aggregate results is called Bagging
How do we evaluate it?
Remember we allowed duplicate entries in the bootstraped dataset, so there will be few data which is not included. Typically, about 1/3 of the original data does not end up in the dootstraped dataset. This is called the Out-of-Bag Dataset
- We run this Out-of-Bag sample through all of the trees that were built without it.
- Do the same for all Out-of-Bag samples for all of the trees.
- The propotion of Out-of-Bag samples that were incorrectly classified is the Out-of-Bag Error.
OK, we now know how to:
- Build a Random Forest
- Use a Random Forest
- Estimate the accuracy of a RAndom Forest
Now lets go to step 1 to improve the model
- Remember when we built the tree, we only used 2 Variables to make a decision at each step.
- Now we update to value to 3 (total is 3) calculate the Out-of-Bag Error. (We cannot use 1 or all (4) for making decision)
- We test a bunch of different settings and choose the most accurate random forest.
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=50, criterion="gini", max_depth=4, min_samples_split=3, random_state=0)
AdaBoost with Decision Trees and Random Forests Main Differences
The Trees are usually just a node and two leaves (calles stumps).
- Forest of Stumps.
- Stumps are technically weak learners.
- In Random Forest, each trees are different in size and depth.
Some stumps get more say in the final classification that others.
- In Random Forest, each tree has an equal vote on the final classification.
Trees are made in order, the error that the first stump makes influence how the second stump is made, etc.
- In Random Forest each trees is made independeltly of the others.
AdaBoost Model Building Steps
Take a dataset (8 samples for example) and assign the sample weight to each of the sample.
- All samples get the same weight and that makes all samples equally important.
- Sample weight is 1/total No. of samples => 1/8
Build the stumps and calculate the Gini Index
- The stump with the lower gini index will be the first stump.
Calculate the Amount of Say each stump has.
- Amount of say = 1/2 * log((1 - Total Error)/Total Error)
- Say (Positive, Neutral, Negative)
Increase/Decrease Sample Weight.
- For Incorrectly Classified (Increase Weight) => New Sample Weight = Sample weight * e^amount of say
- For Correctly Classified (Decrease Weight) => New Sample Weight = Sample weight * e^-amount of say
- Now we normalize the sample weights, so they add upto 1.
- Now update the sample weight with the new sample weight.
Now we create a New dataset of same size, but will make sure the new collection contains samples with larger Sample Weights (Allow Duplicates).
- Now that the error made by the previous stump influence how the next stump is built.
How to make Classification?
- Pass the data through all the stumps and do the prediction (yes or no).
- Get the amount of say for each stump.
- And classified based on the total sum of the say.
Gradient Boost in comparision with AdaBoost
- Builds Stumps (Node with 2 leaves)
- Calculates the Amount of say.
- Created new data set with more weights.
- Repeates untill no. of defined stumps created.
- Finally Classification.
Starts by making a single leaf, instead of tree or stump.
- For regression, it is the Average Value.
Then the Gradient Boost Builds a tree.
- This tree is based on the error made by the previous tree
- Unlike AdaBoost this tree is usually larger than a stump. (GB still restricts the size of the tree).
- Max No. of tree often used is between 8 and 32.
- And finally it scales the tree by equal amount.
GradientBoost Model Building Steps
- Single leaf (Average Value)
- Calculate the residuals
Build a tree to predict the residuals.
- Data would end up in different leaves
- Average if more than one residuals end up in a leaf.
- Now we combine the original leaf with the new tree to make a new prediction.
- New Prediction = Avg. Weight + (LR * Tree Prediction)
Now go to step 2 and repeat
- Each time new tree is built.
- Now combine the new tree with previous tree and initial weight.
- Every tree is scaled by Learning Rate.
- 0.5 as initial prediction (Reg or Class)
- Calculate the Residuals
Build a regression to fit the residuals.
- Unlike GB, XGB builds a unique regression tree
- Each tree starts as a single leaf and all the residuals go the leaf.
- Calculate the Quality Score or Similarity Score for the Residuals.
- Similarity Score = (Sum of R)^2 / (No. of R + lambda) lambda -> Regularization Term
- REsiduals cancel each other, and lambda = 0, similarity = 4 for 1 leaf
Further split the node Threshold (Threshold is the Average of observations)
- Calculate the similarity score.
- Calculate Gain => LeftSim + RightSim - RootSim
Now calculate the Gain for the other thresholds.
- Build the tree
- Calculate similarity score for each leaf
- And calculate Gain
Compare all the Gain
- Gain -> 120, 14, 56
- So we will use the 120 (dosage < 15) for the first branch in the tree.
Iterate the same process for the leaves with more then one residuals.
- Step 5
- Calculate Similarity Scores and Gain.
- You can further increase the levels but the default is to allow up to 6 levels.
Prune the Tree
- Take a value galled Gamma
- Gain - Gamma, if the value is negative we prune the tree and if positive, not remove the branch.
- If the leaf node gives +ve value and root node gives -ve value, we do not remove the root.
- Sometimes we will end up with removing the entire tree itself. (Extreme Pruning)
Now we again build the tree, this time we change the lambda(Reg. Term)
- Lambda = 1, to prevent overfitting
- Setting lambda 0, does not turn off pruning.
Calculate the output value for each leaves.
- Sum of R / (No. of R + lambda) [similar to similarity score, except we do not square]
- Calculate output value for all trees
- Our first tree is complete
Make New Prediction
- Same as GB
- 0.5 + LR X Output Value
- Here LR is known as eta (e), default = 0.3
- New prediction will be moving closer to the values giving low residual values.
Now build a new tree based on the new residuals.
- Goto step 4
- we calculate similarity scores and gain to determing how to split the data
we prune tree by calculating the diffrence between Gain values and Gamma (y)
- -ve prune tree
- +ve no prune
- Calculate output values