THEORITICAL UNDERSTANDING

  • In a layman term it is like you are planning to go to trip and you don’t know about the places so you ask your friends to suggest you about the places that they know. Let’s suppose they suggest place X and place Y. Now you have a condition to choose one from two places. On the basis of your condition you can choose one place or another place. This is exactly how decision tree works.

    How Decision Tree works?

  • We first calculate the entropy or gini impurity of every possible split of features.
  • Then we choose the feature which has the minimum entropy or gini value.
  • After choosing a best features for splitting purpose then we calculate the information gain and split further.

Obvious question then what is Entropy, GINI value and Information Gain?

  • Entropy

    • It measures the randomness in the information being processed or we can say it measures the purity of split. The higher the entropy, harder it is to draw any conclusion from that information.
    • Entropy is nothing but the measure of disorder.
      • All about Entropy:
      • dec2
    • Where ‘Pi’ is simply the probability of an element/class ‘i’ in our data. For simplicity let’s say we only have two classes , a positive class and a negative class. Therefore ‘i’ here could be either + or (-). So if we had a total of 100 data points in our dataset with 30 belonging to the positive class and 70 belonging to the negative class then ‘P+’ would be 3/10 and ‘P-’ would be 7/10. Pretty straightforward.
    • Entropy value ranges from 0 to 1.
  • GINI Impurity

    • It is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that instance is randomly assigned to one of the two classes.
    • It helps us to determine which splitter is best so that we can build a pure decision tree.
    • Gini impurity ranges from 0 to 0.5.
      • dec3
      • In above figure P(yes) is the probability of positive class and P(No) is the probability of negative class.
  • Information Gain

    • Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
      • It is calculated as follows:
      • dec4
      • In above figure Sv is the number of yes and no in each split in our case split B and split C. Then S is the total number of yes and No in our tree. Also Entropy(S) and Entropy(Sv) are the entropy of our tree and split B and split C respectively.

Difference between Entropy and Gini Impurity.

  • dec5
    • From above figure we can see that the maximum value of entropy is 1 and maximum value of gini impurity is 0.5 as our probability of yes increases.
    • So choosing gini impurity over entropy will reduce the time complexity of the algorithm because here we don’t have to calculate the log of the probability.
    • This is the main reason gini impurity is used in most of the state-of-the-art algorithmss.

INTERVIEW TOPICS

Advantages of Decision Tree

  1. Clear Visualization: The algorithm is simple to understand, interpret and visualize as the idea is mostly used in our daily lives. Output of a Decision Tree can be easily interpreted by humans.

  2. Simple and easy to understand: Decision Tree looks like simple if-else statements which are very easy to understand.

  3. Decision Tree can be used for both classification and regression problems.

  4. Decision Tree can handle both continuous and categorical variables.

  5. No feature scaling required: No feature scaling (standardization and normalization) required in case of Decision Tree as it uses rule based approach instead of distance calculation.

  6. Handles non-linear parameters efficiently: Non linear parameters don’t affect the performance of a Decision Tree unlike curve based algorithms. So, if there is high non-linearity between the independent variables, Decision Trees may outperform as compared to other curve based algorithms.

  7. Decision Tree can automatically handle missing values.

  8. Decision Tree is usually robust to outliers and can handle them automatically.

  9. Less Training Period: Training period is less as compared to Random Forest because it generates only one tree unlike forest of trees in the Random Forest.

Disadvantages of Decision Tree

  1. Overfitting: This is the main problem of the Decision Tree. It generally leads to overfitting of the data which ultimately leads to wrong predictions. In order to fit the data (even noisy data), it keeps generating new nodes and ultimately the tree becomes too complex to interpret. In this way, it loses its generalization capabilities. It performs very well on the trained data but starts making a lot of mistakes on the unseen data.

  2. High variance: As mentioned in point 1, Decision Tree generally leads to the overfitting of data. Due to the overfitting, there are very high chances of high variance in the output which leads to many errors in the final estimation and shows high inaccuracy in the results. In order to achieve zero bias (overfitting), it leads to high variance.

  3. Unstable: Adding a new data point can lead to re-generation of the overall tree and all nodes need to be recalculated and recreated.

  4. Not suitable for large datasets: If data size is large, then one single tree may grow complex and lead to overfitting. So in this case, we should use Random Forest instead of a single Decision Tree.

Whether Feature Scaling is required?

Feature scaling is only require when there is distance calculation is involbed but in case of Decision Tree, it is rule based approach so it is not required.

Impact of outliers?

  • It is not sensitive to outliers.Since, extreme values or outliers, never cause much reduction in RSS, they are never involved in split. Hence, tree based methods are insensitive to outliers.

  • Outliers does not impact Decision tree as they does not use any distance metrics to classify the output.

When you have a numerical feature and categorical feature in your dataset then how decision tree is constructed ?

  • For Numerical Features

    • First all the numerical values are sorted in ascending order.
    • Then threshold is set for each feature and decision tree is constructed. For example if we have dataset having two features X and Y which contains five numerical values each which ranges from 1 to 5 then setting threshold as value greater than 3.
    • Now value less than 3 comes in one branch and greater than 3 comes in other branch, like this we get a tree for numerical features.
  • For Categorical Features

    • we will split them just by {yes/no} and calculate the total gini impurity and gain.

How overfitting is handled in decision tree?

  • One way is using pruning.

    • Pruning is the process of removing the nodes which are not required to reach the required accuracy.
    • For example pruning is like a removal of selected part of plant such as bud,branches and roots . In Decision Tree pruning does the same task it removes the branchesof decision tree to overcome the overfitting condition of decision tree. This can be done in two ways, we will discuss both the techniques in detail.
    • Pre-prunning

      • Pre-pruning is the process of pruning the tree before training.
      • Pre-Pruning can be done using Hyperparameter tuning.
      • Overcome the overfitting issue.
    • Post-prunning

      • Post-pruning is the process of pruning the tree after training.
      • Post-prunning is used when decision tree will have very large depth and will show overfitting of model.
      • It is also known as backward pruning.
      • This technique is used when we have infinitely grown decision tree.
  • Other way is using Random Forest

    Types of Problems it can solve(Supervised)

    Classification

  • Already discussed above.

    Regression

  • Let’s take an example where we have a dataset containing two features X and Y. Here we take a condition to split a tree like X>2 and Y<5. Now satisfying this condition go to left and not satisfying go to right.

  • For splitting a tree we calculate the variance of all node.
  • dec6
  • Then we have to calculate variance reduction of every node, which is calculated as:
  • dec7
  • In the figure Wi(weight) is the number of examples in the child node divided by number of examples in a parent node.
  • As the Var-Red1 is greater than Var-Red2, we will select the upper split.
  • Similarly this step is repeated for all the nodes.

Performance Metrics

Classification

  1. Confusion Matrix
  2. Precision,Recall, F1 score

Regression

  1. R2,Adjusted R2
  2. MSE,RMSE,MAE