Random Forest

April 08, 2020

Bias

The inability of a Machine Learning Model to capture the true relationship is known as bias.

Straight Line, High Bias for Training Data, Low variance for Testing Data Squiggly Line, Low Bias for Training Data, High variance for Testing Data

The Ideal model has a sweet spot where both bias and variance are low.*

Cross-Validation

Cross-validation allows us to compare different Machine LEarning methods and get a sense of how well they work.

If you divide the dataset into 4 parts and use the first 3 for training and last 1 for testing, how will you know whether it is a proper split?

The best way is to build a model with a different combination of training and testing dataset. And then summarise the result.

Ridge Regression is used for tuning

Random Forest Steps

Bagging and Boosting
Bagging = Bootstrap + Aggregation
Create Bootstraped dataset
- Same datapoint can repeate
- 1/3 of data remain un selected
- same size as the original dataset
Root Node using Information Gain
- Feature Variable is chooses, let`s V = 2
- Randomly select features for root node split
Build a tree but only considering random subset of variables (2 here) at each step.
Repreate the above steps and build 100`s of trees.
For prediction
- Classification, voting (Majority output is chooses)
- Regression, average of all tree`s output.
How do we know if the RF is good?
- Remember, when creating BootStraped dataset we allowed duplicates
- 1/3 of the dataset does not end up in the BS dataset
- They are known as Out-of-Bag Dataset
Pass all Out-of-Bag dataset samples through all the trees which is built without and calculate Out-of-Bag error.
Now we repeate the same process by increasing the number of features for node selection.
- Normally the variable size is squared, v1 = 2, v2 = 2^2 = 4
- Now the tree is built by choosing 4 features for node selection and the same process is repeated.

tl;dr:

Bagging and random forests are “bagging” algorithms that aim to reduce the complexity of models that overfit the training data.

In contrast, boosting is an approach to increase the complexity of models that suffer from high bias, that is, models that underfit the training data.