Ensemble Techniques
What is XGBoost?
 Optimized gradientboosting machine learning library
 Originally ritten in C++

Has API in several languages:
 Python
 R
 Scala
 Julia
 Java
What n=makes XGBoost so popular?
 Speed and Perfomance
 Core algorithm is parallelizable across CPU and GPU cores and also multuple GPUs across networks of computers.
Random Forest

Create a Bootstrapped Dataset.
 Same size as the original.
 We are allowed to pick the same sample more than once.

Create a Decision Tree using the bootstraped dataset, but only use a random subset of variables (or columns) at each step.
 Let's say you have [age, height, weight, gender], you only use 2 variable out of 4.
 There is a better way to find the optimal value (using Grid Search)

Split the Root Node (randomly select 2 variables, say [age, weight]).
 Use Gini/Entrophy to choose which one gives the proper split.
 And we build the tree as usual, but only considering a random subset variables (2 in our case) at each step.

Now goto step 1 and repeat.
 Make a new bootstraped dataset.
 Build a tree considering a subset of variables at each step.
 YOu build 100's of trees.
Now we have the trees, how do we use it?

Take a new data and pass it through all the trees.
 80 trees yes, 20 trees no
 Yes received the most votes, so we will conclude Yes as the output.
Bootstraping the data + aggregate results is called Bagging
How do we evaluate it?
Remember we allowed duplicate entries in the bootstraped dataset, so there will be few data which is not included. Typically, about 1/3 of the original data does not end up in the dootstraped dataset. This is called the OutofBag Dataset
 We run this OutofBag sample through all of the trees that were built without it.
 Do the same for all OutofBag samples for all of the trees.
 The propotion of OutofBag samples that were incorrectly classified is the OutofBag Error.
OK, we now know how to:
 Build a Random Forest
 Use a Random Forest
 Estimate the accuracy of a RAndom Forest

Now lets go to step 1 to improve the model
 Remember when we built the tree, we only used 2 Variables to make a decision at each step.
 Now we update to value to 3 (total is 3) calculate the OutofBag Error. (We cannot use 1 or all (4) for making decision)
 We test a bunch of different settings and choose the most accurate random forest.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50, criterion="gini", max_depth=4, min_samples_split=3, random_state=0)
More Here
sklearn.ensemble.RandomForestClassifier
AdaBoost
AdaBoost with Decision Trees and Random Forests Main Differences

The Trees are usually just a node and two leaves (calles stumps).
 Forest of Stumps.
 Stumps are technically weak learners.
 In Random Forest, each trees are different in size and depth.

Some stumps get more say in the final classification that others.
 In Random Forest, each tree has an equal vote on the final classification.

Trees are made in order, the error that the first stump makes influence how the second stump is made, etc.
 In Random Forest each trees is made independeltly of the others.
AdaBoost Model Building Steps

Take a dataset (8 samples for example) and assign the sample weight to each of the sample.
 All samples get the same weight and that makes all samples equally important.
 Sample weight is 1/total No. of samples => 1/8

Build the stumps and calculate the Gini Index
 The stump with the lower gini index will be the first stump.

Calculate the Amount of Say each stump has.
 Amount of say = 1/2 * log((1  Total Error)/Total Error)
 Say (Positive, Neutral, Negative)

Increase/Decrease Sample Weight.
 For Incorrectly Classified (Increase Weight) => New Sample Weight = Sample weight * e^amount of say
 For Correctly Classified (Decrease Weight) => New Sample Weight = Sample weight * e^amount of say
 Now we normalize the sample weights, so they add upto 1.
 Now update the sample weight with the new sample weight.

Now we create a New dataset of same size, but will make sure the new collection contains samples with larger Sample Weights (Allow Duplicates).
 Now that the error made by the previous stump influence how the next stump is built.
How to make Classification?
 Pass the data through all the stumps and do the prediction (yes or no).
 Get the amount of say for each stump.
 And classified based on the total sum of the say.
Gradient Boost
Gradient Boost in comparision with AdaBoost
AdaBoost
 Builds Stumps (Node with 2 leaves)
 Calculates the Amount of say.
 Created new data set with more weights.
 Repeates untill no. of defined stumps created.
 Finally Classification.
Gradient Boost

Starts by making a single leaf, instead of tree or stump.
 For regression, it is the Average Value.

Then the Gradient Boost Builds a tree.
 This tree is based on the error made by the previous tree
 Unlike AdaBoost this tree is usually larger than a stump. (GB still restricts the size of the tree).
 Max No. of tree often used is between 8 and 32.
 And finally it scales the tree by equal amount.
GradientBoost Model Building Steps
 Single leaf (Average Value)
 Calculate the residuals

Build a tree to predict the residuals.
 Data would end up in different leaves
 Average if more than one residuals end up in a leaf.
 Now we combine the original leaf with the new tree to make a new prediction.
 New Prediction = Avg. Weight + (LR * Tree Prediction)

Now go to step 2 and repeat
 Each time new tree is built.
 Now combine the new tree with previous tree and initial weight.
 Every tree is scaled by Learning Rate.
XGBoost
Steps
 0.5 as initial prediction (Reg or Class)
 Calculate the Residuals

Build a regression to fit the residuals.
 Unlike GB, XGB builds a unique regression tree

XGBoost Trees
 Each tree starts as a single leaf and all the residuals go the leaf.
 Calculate the Quality Score or Similarity Score for the Residuals.
 Similarity Score = (Sum of R)^2 / (No. of R + lambda) lambda > Regularization Term
 REsiduals cancel each other, and lambda = 0, similarity = 4 for 1 leaf

Further split the node Threshold (Threshold is the Average of observations)
 Calculate the similarity score.
 Calculate Gain => LeftSim + RightSim  RootSim

Now calculate the Gain for the other thresholds.
 Build the tree
 Calculate similarity score for each leaf
 And calculate Gain

Compare all the Gain
 Gain > 120, 14, 56
 So we will use the 120 (dosage < 15) for the first branch in the tree.

Iterate the same process for the leaves with more then one residuals.
 Step 5
 Calculate Similarity Scores and Gain.
 You can further increase the levels but the default is to allow up to 6 levels.

Prune the Tree
 Take a value galled Gamma
 Gain  Gamma, if the value is negative we prune the tree and if positive, not remove the branch.
 If the leaf node gives +ve value and root node gives ve value, we do not remove the root.
 Sometimes we will end up with removing the entire tree itself. (Extreme Pruning)

Now we again build the tree, this time we change the lambda(Reg. Term)
 Lambda = 1, to prevent overfitting
 Setting lambda 0, does not turn off pruning.

Calculate the output value for each leaves.
 Sum of R / (No. of R + lambda) [similar to similarity score, except we do not square]
 Calculate output value for all trees
 Our first tree is complete

Make New Prediction
 Same as GB
 0.5 + LR X Output Value
 Here LR is known as eta (e), default = 0.3
 New prediction will be moving closer to the values giving low residual values.

Now build a new tree based on the new residuals.
 Goto step 4
Summary
 we calculate similarity scores and gain to determing how to split the data

we prune tree by calculating the diffrence between Gain values and Gamma (y)
 ve prune tree
 +ve no prune
 Calculate output values