Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 2)

Author:Murphy  |  View: 28736  |  Time: 2025-03-22 21:54:54

Welcome back to a new chapter of "Courage to Learn ML." For those new to this series, this series aims to make these complex topics accessible and engaging, much like a casual conversation between a mentor and a learner, inspired by the writing style of "The Courage to Be Disliked," with a specific focus on machine learning.

This time we will continue our exploration into how to overcome the challenges of vanishing and exploding gradients. In our opening segment, we talked about why it's critical to maintain stable gradients to ensure effective learning within our networks. We uncovered how unstable gradients can be barriers to deepening our networks, essentially putting a cap on the potential of deep "learning". To bring these concepts to life, we use the an analogy of running a miniature ice cream factory named DNN (short for Delicious Nutritious Nibbles), and draw parallels to illuminate potent strategies for DNN training akin to orchestrating a seamless factory production line.

Now, in this second installment, we're diving deeper into each proposed solution, examining them with the same clarity and creativity that brought our ice cream factory to life. Here are the list of topics we'd cover in this part:

  1. Activation Functions
  2. Weight Initialization
  3. Batch Normalization
  4. In Practice (Personal Experience)

Activation Functions

Activation functions are the backbone of our "factory" setup. They're responsible for passing along information in both forward and backward propagation within our DNN assembly line. Picking the right ones is crucial for the smooth operation of our DNN assembly line and, by extension, our DNN training process. This part isn't just a simple rundown of activation functions along with their advantages and disadvantages. Here, I will use the Q&A format to uncover the deeper reasoning behind the creation of different activation functions and to answer some important questions that often are overlooked.

Think of these functions as the blenders in our ice cream production analogy. Rather than offering a catalog of available blenders, I'm here to provide an in-depth review and understand the innovations of each and the reasons behind any specific enhancements.

What exactly are activation functions, and how do I choose the right one?

Image created by the author using ChatGPT.

Activation functions are the key elements to grant a neural network model the flexibility and power to capture both linear and nonlinear relationships. The key distinction between logistic regression and DNNs lies in these activation functions combined with multiple layers. They together allow NNs to approximate a wide range of functions. However, this power comes with its challenges. The choice of activation function needs more careful consideration. The wrong selection can stop model from learning effectively, especially during backpropagation.

Picture yourself as the manager of our DNN ice cream factory. You'd want to meticulously select the right activation function (think of them as ice cream blenders) for your production line. This means doing your homework and sourcing the best fit for your needs.

So, the first step in choosing an effective activation function involves addressing two key questions:

How does the choice of activation function affect issues like vanishing and exploding gradients? what criteria define a good activation function?

Note, to deal with the unstable gradient, our discussion focus on the activations in the hidden layers. For output activation function, the choice depends on the task whether its regression or classification problems, and if it's a multiclass problem.

When dealing with the choice of activation function in hidden layer, the problem is more related to vanishing gradient. This can be traced back to our traditional activation function sigmoid (the very traditional or basic model). The sigmoid function was widely used due to its ability to map inputs to a probability range (0, 1), which is particularly useful in binary classification tasks. This capability allowed researchers to adjust the probability threshold for categorizing predictions, enhancing model flexibility and performance.

However, its application in hidden layers has led to significant challenges, most notably the vanishing gradient problem. This can be attributed to two main factors:

  • During the forward pass, the sigmoid function compresses inputs to a very narrow range between 0 and 1. If one network only uses sigmoid as activation function in hidden layers, then repeated application through multiple layers further narrows this range. This compression effect not only reduces the variability of outputs but also introduces a bias towards positive values. Since outputs remain between 0 and 1 regardless of the input sign.
  • During backpropagation, the derivative of the sigmoid function (which has a bell-shaped curve) yields values between 0 and 0.25. This small range can cause gradients to diminish rapidly despite the input as they propagate through multiple layers, resulting in vanishing gradients. Since earlier layer gradients are products of successive layer derivatives, this compounded product of small derivatives results in exponentially smaller gradients, preventing effective learning in earlier layers.

To overcome these limitations, an ideal activation function should exhibit the following properties:

  • Non-linearity. Allowing the network to capture complex patterns.
  • Non-saturation. The function and its derivative should not compress the input range excessively, preventing vanishing gradients.
  • Zero-centered Output. The function should allow for both positive and negative outputs, ensuring that the mean output across the nodes does not introduce bias towards any direction.
  • Computational Efficiency. Both the function and its derivative should be computationally simple to facilitate efficient learning.

Given these essential properties, how do popular activation functions build upon our basic model, the Sigmoid, and what makes them stand out?

This section aims to provide a general overview of nearly all the current activation functions.

Tanh, A Simple Adjustment to Sigmoid. The hyperbolic tangent (tanh) function can be seen as a modified version of the sigmoid, offering a straightforward enhancement in terms of output range. By scaling and shifting the sigmoid, tanh achieves an output range of [-1, 1] with zero mean, This zero-centered output is advantageous as it aligns with our criteria for an effective activation function, ensuring that the input data and gradients are less biased toward any specific direction, whether positive or negative.

Despite these benefits, tanh retains the core characteristic of sigmoid in terms of its non-linear shape, which means it still compresses the output into a narrow range. This compression leads to similar issues as observed with sigmoid, which causes gradients to saturate. Therefore it affecting the network's ability to learn effectively during backpropagation.

ReLU, a popular choice in NNs. ReLU (Rectified Linear Unit) stands out for its simplicity, operating as a piecewise linear function where f(x) = max(0, x). This means it outputs zero for any negative input and mirrors the input otherwise. What makes ReLU particularly appealing is its straightforward design, satisfying three of those four key properties (we discussed above) with ease. Its linear nature on the positive side avoids compressing outputs into a tight range, unlike sigmoid or tanh, and its derivative is simple, being either 0 or 1.

One intriguing aspect of ReLU is its ability to turn off neurons for negative inputs, introduces sparsity to models. Similar to the effect of dropout regularization by deactivating certain neurons. This can lead to more generalized models. However, it also leads to the "dying ReLU" issue, where neurons become inactive and stop learning due to zero output and gradient. While some neurons may come back to life, those in early layers are particularly could be permanently deactivated. This is similar to halting feedback in an ice cream production line, where the early stages fail to adapt based on customer feedback or contribute useful intermediate products for subsequent stages.

Another point of consideration is ReLU's non-differentiability at x=0, due to the sharp transition between its linear segments. In practice, frameworks like PyTorch manage this using the concept of subgradients, often setting the derivative at x=0 to 0.5 or another value within [0, 1]. This typically doesn't pose an issue due to the rarity of exact zero inputs and the variability of data.

So, is ReLU the right choice for you? Many researchers say yes, thanks to its simplicity, efficiency, and support from major DNN frameworks. Moreover, recent studies, like one at https://arxiv.org/abs/2310.04564, highlight ReLU's ongoing relevance, marking a kind of renaissance in the ML world.

In certain applications, a variant known as ReLU6, which caps the output at 6, is used to prevent overly large activations. This modification, inspired by practical considerations, further illustrates the adaptability of ReLU in various neural network architectures. Why capping to 6? You can find answer in this post.

Leaky ReLUs, a slight twist on the classic ReLU.When we take a closer look at ReLU, a couple of issues emerge. its zero output for negative inputs, leading to the "dying ReLU" problem where neurons cease to update during training. Additionally, ReLU's preference for positive values can introduce a directional bias in the model.To counter these drawbacks while retaining ReLU's advantages, researchers developed several variations, including the concept of ‘leaky' ReLUs.

Leaky ReLUs modifies the negative part of ReLU, giving it with a small and non-zero slope. This adjustment allows negative inputs to produce small negative output, effectively ‘leaking' through the otherwise zero-output region. The slope of this leak is controlled by a hyperparameter α, which is typically set close to 0 to maintain a balance between sparsity and keeping neurons active. By allowing a slight negative output, Leaky ReLU aims to centralize the activation function's output around zero and prevent neurons from becoming inactive, thus addressing the "dying ReLU" issue.

However, introducing α as a hyperparameter adds a layer of complexity to model tuning. To manage this, variations of the original Leaky ReLU have been developed:

  • Randomized Leaky ReLU (RReLU): This version randomizes α within a specified range during training, fixing it during evaluation. The randomness can help in regularizing the model and preventing overfitting.
  • Parametric Leaky ReLU (PReLU): PReLU allows α to be learned during training, adapting the activation function to the specific needs of the dataset. Even though this can enhance model performance by tailoring α to the training data, it also risks overfitting.

Exponential Linear Unit (ELU), an Improvement on Leaky ReLU by Enhancing Control Over Leakage. Both Leaky ReLUs and ELUs allow negative values, which help in pushing mean unit activations closer to zero and maintaining the vitality of the activation functions. The challenge with Leaky ReLUs is their inability to regulate the extent of these negative values; theoretically, these values could extend to negative infinity, despite intentions to keep them small. ELU addresses this by incorporating a nonlinear exponential curve for non-positive inputs, effectively narrowing and controlling the negative output range to a maximum of −

Tags: Courage To Learn Ml Data Science Deep Dives Deep Learning Machine Learning

Comment