THEORITICAL UNDERSTANDING

  • Why we need a activation function in a neural network?

    • Due to the following reason we need a activation function in a neural network:
      • To bring a Non Linearity in a neural network.
        • The graph of the function is different from a line where a change of the input is not proportional to the change of the output. This is known as non linearity which is the reason why we need a activation function.
      • To bring differentiability in a neural network.
        • The gradient of the loss function is calculated during backpropagation using the gradient descent method. As a result, the activation function must be differentiable with respect to its input.
      • To bring a Continous properties in a neural network.
        • A continuous function is one in which a continuous variation of the argument causes a continuous variation of the function’s value. Which is essential in the case of a neural network.
      • To bring Bounded property in a neural network.
        • A bounded function’s range has both a lower and upper bound. This is relevant for neural networks because the activation function is responsible for keeping the output values within a certain range, otherwise, the values may exceed reasonable value
      • To bring Zero Centering property in a neural network.
        • When a function’s range contains both positive and negative values, it is said to be zero-centered. If a function is not zero-centered, as the sigmoid function, the output of a layer is always shifted to either positive or negative values. As a result, the weight matrix requires more updates to be adequately trained, increasing the number of epochs needed to train the network. This is why the zero-centered property is useful, even if it isn’t required
      • From following picture we can understand the importance of activation function in a neural network: act1

      • Low computational cost requires less time for a neural network to be trained, while the opposite is also true. Hence, the search to find a function with lower computational cost has been highly desired in the machine learning research field.
  • Types of Activation Function

    • Sigmoid

      • The sigmoid function, often called the logistic sigmoid function, is one of the most commonly known functions used in feedforward neural networks today. This is primarily due to its nonlinearity and the simplicity of the derivative, which is relatively computationally inexpensive. Which is given as:

        act2

        act3

        Photo Credit–>Machinelearning Mastery

      • Ranges of sigmoid activation function is between 0 and 1.
      • Derivatives of sigmoid function ranges from 0 to 0.25.
      • The sigmoid function binds a large range of inputs to a small range between 0 and 1. Therefore, a large change to the input value leads to a small change to the output value, resulting into small gradient values as well. Because of the small gradient values, it may be prone to suffering from the vanishing gradient probl
      • Problem caused by sigmoid function is Vanishing Gradient Problem.
        • Vanishing Gradient Problem

          • In case of sigmoid activation function, while we are doing backpropagation we take the derivative of the sigmoid function whose value ranges from 0 to 0.25. Now we have to take such derivative of the sigmoid function again and again to get the gradient of the loss function, so the gradient of the loss function is very very small.
          • Now when we are updating a weight, we multiply the gradient of the loss function with the learning rate. Here learning rate is also very small so our overall multiplication of the gradient of the loss function with the learning rate is very very small.
          • So our updated weight is very very similar to the initial weight, which is not required. This problem is known as Vanishing Gradient Problem.
      • To solve this problem different activation function is used.
  • TanH

    • The hyperbolic tangent, or tanh function became more popular than the sigmoid function because in most cases, it gives better training performance for multi-layer neural networks. The tanh function inherits all the valuable properties of the sigmoid function. The tanh function is defined as:

      act4

      • Graph:

      act5

      Photo Credit

    • The tanh function is continuous, differentiable and bounded, and ranges between -1 and 1.
    • Derivative of TanH function ranges from 0 to 1.
    • The hyperbolic tangent, or tanh function, was created to combine the advantages of the sigmoid function with its zero-centered nature so that Vanishing Gradient Problem will be reduced.
    • The range of possible outputs expanded, including negative, positive, and zero outputs.
    • Moreover, the tanh function is zero-centered, hence, reducing the number of epochs needed to train the network as compared to the sigmoid function. The zero-centered property is one of the main advantages provided by the tanh function, thereby helping the backpropagation process.
    • The tanh function, in a similar way to sigmoid, binds a large range of input to a small range between -1 and 1. Thus, a large change to the input value leads to a small change to the output value. This results in close to zero gradient values. Because the gradient values may get close to zero, tanh suffers vanishing gradient problems. The vanishing gradient problem prompted more research into activation functions, which led to the development of ReLu activation functions.
  • ReLU

    • Since it’s proposal [6], the rectified linear unit function (ReLU) has been widely used in neural networks because of its efficient properties. The ReLU function is defined as:

    act6

    • Where x is the input to the activation function. The ReLU function is continuous, not-bounded and not zero-centered.

    • Different than the sigmoid and tanh function, ReLU is not exponential, thus, it has low computational cost as it forces negative values to zero. This feature makes the ReLU function a better candidate to be used in neural networks as it provides better performance and generalization when compared to the sigmoid and tanh functions.
    • Negative inputs passed on to the ReLU function are evaluated to a zero output. As a consequence, negatively weighted neurons do not contribute to the overall performance of the network, suffering the previously seen dead neuron problem.
    • A new variant of the ReLU, called LReLU, was introduced in an attempt to solve the dead neuron problem.
  • LReLU

    • The leaky ReLU (LReLU) function is continuous, not-bounded, zero-centered, and it has low computational cost. The LReLU is defined as:

      act7

    • Unlike the ReLU, the LReLU function allows negative inputs to be passed on as outputs.
    • For x input values smaller than 0, the left-hand derivative is 0.01, while the right-hand derivative is 1. For x input values greater than 0, the gradient is always 1, hence, the function does not suffer from the vanishing gradient problem. On the other hand, the gradient of negative outputs will always be 0.01, leading to a risk of potentially suffering from the vanishing gradient problem.
  • PReLU

    • To solve the vanishing gradient problem faced by the leaky ReLU function, the parametric ReLU (PReLU) function was introduced. The PReLU function is defined as:

      act8

    • Where a is a learnable parameter and x is the input to the activation function. When a is 0.01, PReLU function is equal to LReLU, and when a = 0, PReLU function is equal to ReLU. As a result, PReLU can be generally used to express rectifier nonlinearities.
    • The PReLU is non-bounded, continuous, and zero-centered. When x is less than zero, the function’s gradient is a, and when x is greater than zero, the function’s gradient is 1.
    • There is no vanishing gradient problem in the positive part of the PReLU function, where the gradient is always 1. However, the gradient on the negative side is always a, which is often close to zero. It raises the possibility of a vanishing gradient problem.