Random Forest Algorithm.

THEORITICAL UNDERSTANDING
INTERVIEW TOPICS

THEORITICAL UNDERSTANDING

Bagging Algorithm
- It is known as parallel learning process. That means individual decision tree is built independently to each other.
- In bagging technique multiple decision trees are built and then their results are combined to give a output. In classification problem output is given by Majority Vote and in regression problem output is given by taking average of all the tree results.
How Random Forest Works?
- Let’s suppose we have a dataset having dimension MxN, then we need to build a random bag using Bootstrap Aggregation.
- That bags are created using Row sampling and Column Sampling with Replacement(that means values are again put in a place where it previously resides). Each bag contains lesser number of rows and columns than original M*N.
- Then we build a decision tree using that bag and predict the output.
- Now final prediction is done by taking Majority Vote in case of calssification and by taking Average of all the tree results in case of regression.
What is Bootstrap Aggregation?
- Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).
- Bagging leverages a bootstrapping sampling technique to create diverse samples.
- This resampling method generates different subsets of the training dataset by selecting data points at random and with replacement.
- This means that each time you select a data point from the training dataset, you are able to select the same instance multiple times. As a result, a value/instance repeated twice (or more) in a sample. So that High variance of Decision Tree is converted to low Variance which is the greatest benefit of Random Forest Algorithms

INTERVIEW TOPICS

Important properties of Random Forest Classifiers

Decision Tree—Low Bias And High Variance
Ensemble Bagging(Random Forest)–Low Bias And Low Variance

1. What Are the Basic Assumption?

There are no such assumptions.

2. Advantages of Random Forest

Doesn’t Overfit
Favourite algorithm for Kaggle competition
Less Parameter Tuning required
Decision Tree can handle both continuous and categorical variables.
No feature scaling required: No feature scaling (standardization and normalization) required in case of Random Forest as it uses DEcision Tree internally
Suitable for any kind of ML problems

3. Disadvantages of Random Forest

Biased With features having many categories
Biased in multiclass classification problems towards more frequent classes.
Blackbox model which we cannot easily describe to others and we don’t know what is happening inside.
When there is large number of features and rows in the dataset, the model will be very slow and it will very expensive(time complexity and model complexity increases).

4. Whether Feature Scaling is required?

Since it is rule based algorithm where distance measurement is not associated so it is not required.

5. Impact of outliers?

Robust to Outliers

6. Difference between Gradient Boosting algorithm and Random forest algorithm?

GB are more prone to overfitting if given data is noisy but this is not the case in RF algorithm
GB takes longer to train since their decision trees are built sequentially but in random forest this is not the case since decision trees are made parallelly.
GB algorithm is harder to tune.
GB utilizes weak learner to get stron prediction but random forest uses maximum voting and taking average to get the prediction.
Random forest is more prone to being biased.
RF donot use sequential approach donot deal with unbalanced datasets.
RF utilizes fully grown decision trees.

Types of Problems it can solve(Supervised)

Classification
Regression

Performance Metrics

Classification

Confusion Matrix
Precision,Recall, F1 score

Regression

R2,Adjusted R2
MSE,RMSE,MAE

THEORITICAL UNDERSTANDING

Bagging Algorithm

How Random Forest Works?

What is Bootstrap Aggregation?

INTERVIEW TOPICS

Important properties of Random Forest Classifiers

1. What Are the Basic Assumption?

2. Advantages of Random Forest

3. Disadvantages of Random Forest

4. Whether Feature Scaling is required?

5. Impact of outliers?

6. Difference between Gradient Boosting algorithm and Random forest algorithm?

Types of Problems it can solve(Supervised)

Performance Metrics

Classification

Regression