A Complete Guide to Neural Network Optimizers
Training neural networks is fundamentally an optimization problem: we’re searching for the best set of weights that minimize our loss function. While the concept sounds straightforward, the path from random initialization to a well-trained model is rarely a smooth descent. The landscape of loss functions in high-dimensional spaces is filled with valleys, plateaus, and saddle points that can trap or slow down naive optimization approaches.
This is where optimization algorithms come in. Over the years, researchers have developed increasingly sophisticated methods to navigate these challenging landscapes more efficiently. Each optimizer builds upon the limitations of its predecessors, introducing new mechanisms to accelerate convergence, handle sparse gradients, or adapt to different learning scenarios.
In this guide, we’ll explore seven key optimization techniques: SGD, Momentum, Nesterov Momentum, AdaGrad, RMSProp, Adam, and AdamW. We’ll examine how each one works, what problems it solves, and when you might want to use it.
Quick Reference: Optimizer Comparison
| Optimizer | Key Feature | Solves Issue in | Pros | Cons |
|---|---|---|---|---|
| SGD | Simple gradient descent | N/A | Easy to implement | Oscillation, fixed learning rate |
| Momentum | Gradient accumulation | SGD | Reduces oscillations | No anticipation of future trends |
| Nesterov | Lookahead gradients | Momentum | Better convergence | Slightly higher computation |
| AdaGrad | Adaptive learning rates | Nesterov | Handles sparse gradients | Learning rate decays too fast |
| RMSProp | Smoothed adaptive learning rates | AdaGrad | Stabilizes learning rates | Sensitive to hyperparameters |
| Adam | Momentum + RMSProp | RMSProp | Combines best features | May converge to suboptimal minima |
| AdamW | Decoupled weight decay | Adam | Better generalization | Requires tuning decay parameter |
1. Stochastic Gradient Descent (SGD)
How It Works: Updates weights by calculating gradients using a small batch of data.
\[w_t = w_{t-1} - \eta \nabla f(w_{t-1})\]Pros:
- Simple and computationally efficient
- Works well with large datasets
Cons:
- Can oscillate or converge slowly, especially in narrow valleys or near saddle points
- Learning rate (η) is fixed, leading to potential overshooting or slow convergence
2. Momentum
How It Works: Accumulates gradients to build momentum in directions with consistent gradients.
\(v_t = \beta v_{t-1} - \eta \nabla f(w_{t-1})\)
\(w_t = w_{t-1} + v_t\)
Pros:
- Speeds up convergence in shallow but consistent directions (e.g., valleys)
- Reduces oscillations compared to SGD
Cons:
- Still overshoots if the learning rate is too high
- Cannot predict future gradient directions
Improvement Over SGD: Addresses oscillation and slow convergence by incorporating past gradients.
3. Nesterov Momentum
How It Works: Looks ahead by computing gradients at the projected position.
\(v_t = \beta v_{t-1} - \eta \nabla f(w_{t-1} + \beta v_{t-1})\)
\(w_t = w_{t-1} + v_t\)
Pros:
- More precise updates by considering where the momentum is leading
- Accelerates convergence further compared to vanilla momentum
Cons:
- Slightly more computationally expensive due to gradient computation at the lookahead point
Improvement Over Momentum: Anticipates future gradient directions, resulting in better convergence.
4. AdaGrad
How It Works: Adjusts the learning rate for each parameter based on the magnitude of past gradients.
\(g_t = \nabla f(w_{t-1})\)
\(w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t, \quad G_t = \sum_{i=1}^t g_i^2\)
Pros:
- Works well for sparse gradients (e.g., NLP tasks)
- Automatically adapts learning rates for each parameter
Cons:
- Learning rate diminishes too quickly due to cumulative gradient sum, leading to potential underfitting
Improvement Over Nesterov Momentum: Introduces adaptive learning rates to handle sparse gradients.
5. RMSProp
How It Works: Modifies AdaGrad by using an exponentially weighted moving average of past squared gradients instead of a cumulative sum.
\(v_t = \beta v_{t-1} + (1 - \beta)(\nabla f(w_{t-1}))^2\)
\(w_t = w_{t-1} - \frac{\eta}{\sqrt{v_t + \epsilon}} \nabla f(w_{t-1})\)
Pros:
- Prevents the learning rate from diminishing too quickly
- Suitable for non-stationary objectives
Cons:
- Sensitive to hyperparameter choices (e.g., β)
Improvement Over AdaGrad: Stabilizes learning rates by introducing an exponentially weighted average of squared gradients.
6. Adam (Adaptive Moment Estimation)
How It Works: Combines Momentum (first moment) and RMSProp (second moment).
-
Update rules:
\(m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla f(w_{t-1})\)
\(v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla f(w_{t-1}))^2\) -
Bias corrections:
\(\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\) -
Update step:
\(w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\)
Pros:
- Combines the benefits of Momentum and RMSProp
- Automatically adjusts learning rates for each parameter
- Bias correction ensures stability in early training
Cons:
- May converge to suboptimal solutions in some scenarios (e.g., small datasets or high regularization)
- Hyperparameter tuning can be challenging
Improvement Over RMSProp: Adds momentum and bias correction to handle noisy gradients and early instability.
7. AdamW
How It Works:
Decouples weight decay from the gradient update to improve generalization.
\[w_t = w_{t-1} - \eta \bigg( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1} \bigg)\]Pros:
- Better generalization compared to Adam
- Retains benefits of adaptive learning rates
Cons:
- Still requires careful hyperparameter tuning
Improvement Over Adam: Decouples weight decay from gradient updates, improving generalization performance.
Detailed Technical Comparison
| Method | Working Mechanism | Pros | Cons | Improvement Over Prior Method |
|---|---|---|---|---|
| SGD | Updates weights using gradients calculated on mini-batches. $w_t = w_{t-1} - \eta\nabla f(w_{t-1})$ | Simple, computationally efficient | Oscillates, slow convergence, fixed learning rate | - |
| Momentum | Accumulates gradients to build momentum for smoother updates. $v_t = \beta v_{t-1} - \eta\nabla f(w_{t-1})$, $w_t = w_{t-1} + v_t$ | Speeds up convergence, reduces oscillations | May overshoot, lacks anticipation of future gradients | Reduces oscillations and improves convergence speed |
| Nesterov | Looks ahead to compute gradients at a projected future position. $v_t = \beta v_{t-1} - \eta\nabla f(w_{t-1} + \beta v_{t-1})$, $w_t = w_{t-1} + v_t$ | More precise updates, faster convergence | Slightly more computationally expensive | Anticipates future gradient directions |
| AdaGrad | Adjusts learning rates based on accumulated squared gradients. $w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}}g_t$, $G_t = \sum g_i^2$ | Adapts learning rates, good for sparse gradients | Learning rate diminishes too quickly, potential underfitting | Introduces adaptive learning rates for sparse features |
| RMSProp | Uses exponentially weighted moving averages of squared gradients. $v_t = \beta v_{t-1} + (1-\beta)g_t^2$, $w_t = w_{t-1} - \frac{\eta}{\sqrt{v_t + \epsilon}}g_t$ | Prevents learning rate decay, handles non-stationary objectives | Sensitive to hyperparameters (e.g., β) | Stabilizes learning rates using moving averages |
| Adam | Combines Momentum (1st moment) and RMSProp (2nd moment) with bias correction. | Fast convergence, handles noisy gradients | May converge to suboptimal minima in some cases | Combines momentum and adaptive learning rates |
| AdamW | Decouples weight decay from gradient updates. $w_t = w_{t-1} - \eta[\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_{t-1}]$ | Better generalization, retains Adam’s benefits | Requires tuning of decay parameter | Improves generalization by decoupling weight decay |
Hyperparameter Reference
| Method | Hyperparameter | Meaning | Typical Values | Tuning Suggestions |
|---|---|---|---|---|
| SGD | Learning rate ($\eta$) | Step size for updating weights | 0.01 to 0.1 | Start with a smaller value and adjust based on convergence |
| Momentum | Momentum coefficient ($\beta$) | Controls the contribution of past gradients to the current update | 0.9 | Keep fixed at 0.9 or tune slightly |
| Nesterov | Momentum coefficient ($\beta$) | Same as Momentum, with anticipation of future gradients | 0.9 | Same as Momentum |
| AdaGrad | Learning rate ($\eta$) | Base learning rate scaled by the inverse square root of accumulated squared gradients | 0.01 | Lower than SGD learning rates to avoid overshooting |
| RMSProp | Learning rate ($\eta$) | Similar to AdaGrad, with smoothing via an exponential moving average | 0.001 to 0.01 | Tune for stability based on loss |
| Decay rate ($\beta$) | Smoothing parameter for the moving average of squared gradients | 0.9 | Commonly fixed at 0.9 | |
| Adam | Learning rate ($\eta$) | Base learning rate for parameter updates | 0.001 | Often works well without much tuning |
| $\beta_1$ | Decay rate for the first moment (mean of gradients) | 0.9 | Usually fixed | |
| $\beta_2$ | Decay rate for the second moment (variance of gradients) | 0.999 | Keep fixed or tune slightly for sensitivity | |
| $\epsilon$ | Small value to avoid division by zero | $10^{-7}$ or smaller | Rarely changed | |
| AdamW | Learning rate ($\eta$) | Same as Adam | 0.001 | Same as Adam |
| $\beta_1$, $\beta_1$, $\epsilon$ | Same as Adam | 0.9, 0.999, $10^{-7}$ | Same as Adam | |
| Weight decay ($\lambda$) | Regularization parameter to control overfitting by penalizing large weights | $10^{-4}$ to $10^{-2}$ | Start small and increase if overfitting is observed |
Conclusion
Choosing the right optimizer can dramatically impact your model’s training efficiency and final performance. While there’s no universal “best” optimizer, understanding the strengths and weaknesses of each approach helps you make informed decisions for your specific use case.
For most modern deep learning applications, Adam and AdamW have emerged as go-to choices due to their robust performance across diverse tasks with minimal hyperparameter tuning. Adam’s combination of momentum and adaptive learning rates makes it particularly effective for handling noisy gradients and training deep networks, while AdamW’s improved weight decay mechanism often leads to better generalization.
However, don’t overlook the classics. SGD with Momentum remains highly competitive, especially for computer vision tasks, and often achieves better final test accuracy when combined with proper learning rate scheduling. For problems with sparse gradients, such as natural language processing with large vocabularies, AdaGrad or RMSProp might be more appropriate.
The key takeaway is that optimizer selection should be guided by your problem’s characteristics: dataset size, gradient sparsity, computational budget, and generalization requirements. Start with a well-established baseline (Adam is usually a safe bet), monitor your training dynamics, and don’t hesitate to experiment with alternatives if you’re not seeing the convergence behavior you expect.
As the field continues to evolve, new optimizers and variants will undoubtedly emerge. But the fundamental principles underlying these seven methods: managing learning rates, leveraging momentum, and adapting to gradient statistics, will remain central to training neural networks effectively.