optim

The Optim module in minima is a flexible and powerful toolbox for optimizing the parameters of your deep learning models. Built on a high-level,
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Raises
  else: warn(msg)

source

Optimizer

 Optimizer (params)

Base class for all optimizers. Not meant to be instantiated directly.

This class represents the abstract concept of an optimizer, and contains methods that all concrete optimizer classes must implement. It is designed to handle the parameters of a machine learning model, providing functionality to perform a step of optimization and to zero out gradients.

Type Details
params Iterable The parameters of the model to be optimized.

SGD Optimizer

This is a PyTorch-style implementation of the classic optimizer Stochastic Gradient Descent (SGD).

SGD update is,

\[ \theta_{t} = \theta_{t-1} - \alpha \cdot g_{t} \]

where \(\alpha\) is the learning rate, and \(g_{t}\) is the gradient at time step \(t\). \(θ_{t}\) represents the model parameters at time step \(t\).

The learning rate \(\alpha\) is a scalar hyperparameter that controls the size of the update at each iteration.

An optional momentum term can be added to the update rule:

\[ \begin{align*} v_{t} & \leftarrow \mu v_{t-1} + (1-\mu) \cdot g_t \\ \theta_{t} & \leftarrow \theta_{t-1} - \alpha \cdot v_t \end{align*} \]

where \(v_{t}\) is the momentum term at time step \(t\), and \(\mu\) is the momentum factor. The momentum term increases for dimensions whose gradients point in the same
direction and reduces updates for dimensions whose gradients change direction, thereby adding a form of preconditioning.

A weight decay term can also be included, which adds a regularization effect:

\[ \theta_{t} = (1 - \alpha \cdot \lambda) \cdot \theta_{t-1} - \alpha \cdot g_t \]

where \(\lambda\) is the weight decay factor. This results in the model weights shrinking at each time step, which can prevent overfitting by keeping the model complexity in check.


source

SGD

 SGD (params, lr=0.01, momentum=0.0, wd=0.0)

Implements stochastic gradient descent (optionally with momentum).

This is a basic optimizer that’s suitable for many machine learning models, and is often used as a baseline for comparing other optimizers’ performance.

Type Default Details
params Iterable The parameters of the model to be optimized.
lr float 0.01 The learning rate.
momentum float 0.0 The momentum factor.
wd float 0.0 The weight decay (L2 regularization).

AdaGrad Optimizer

Intuitive explanation:

Imagine you’re trying to navigate your way across a complex terrain - like a big mountain with lots of hills, valleys and flat areas.
Your goal is to find the lowest valley. This is much like the problem a neural network faces when it’s trying to find the optimal values for its weights - the lowest point in its loss function.

You start at a random point on this terrain, which is like initializing your model with random weights. Now, you need to figure out which direction to go in to get to the lowest point.
You can’t see the whole terrain at once, but you can look around your current location and see which way is downhill. This is like calculating the gradient of the loss function with respect to the weights.

In a basic gradient descent algorithm, you would just go in the direction of the steepest slope with a fixed step size. But this approach can lead to problems.
What if you’re on a steep slope and you take too big of a step? You might overshoot the valley you’re trying to get to. Or, what if you’re on a flat part of the
terrain and you take too small of a step? You might get stuck and not make much progress.

This is where AdaGrad comes in. AdaGrad is like a smart hiker that adjusts its step size based on the terrain it’s currently on.
If it’s on a steep slope, it takes smaller steps to avoid overshooting the valley. If it’s on a flat area, it takes bigger steps to make faster progress.

It does this by keeping track of the sum of the squares of the gradients that it has seen so far (kinda like a memory), and uses this to scale down the step size.
This means that parameters with larger gradients will have their learning rate decreased more, while parameters with smaller gradients will have their learning rate

The neat thing about AdaGrad is that it adjusts the learning rate for each parameter individually, based on what it’s learned about the landscape around that parameter.
This can be especially useful when dealing with sparse data, where only a few parameters might be updated frequently.

Detailed explanation

Building on the foundational concepts of Stochastic Gradient Descent (SGD), we have AdaGrad, an algorithm that introduces an innovative twist to the optimization process.
Unlike traditional SGD that utilizes a single learning rate \(\alpha\) across all parameters, AdaGrad institutes a per-parameter learning rate. The learning rate for AdaGrad is computed as:

\[ \theta_{t} = \theta_{t-1} - \frac{\alpha}{\sqrt{G_t + \epsilon}} \cdot g_{t} \]

where \(\theta_{t}\) represents the model parameters at time step \(t\), \(\alpha\) is the initial learning rate, \(g_{t}\) is the gradient at time step \(t\), \(G_{t}\) is a diagonal matrix
where each diagonal element \(i, i\) is the sum of the squares of the gradients w.r.t. \(\theta_i\) up to time step \(t\), and \(\epsilon\) is a smoothing term to avoid division by zero (usually on the order of \(1e-7\)).

In AdaGrad, each parameter \(\theta_i\) gets its own learning rate, which is inversely proportional to the square root of the sum of the squares of past gradients.
This is the cache in the implementation, which holds a history of squared gradients. The greater the sum of the past gradients for a particular parameter, the smaller the learning rate for that parameter.

This feature allows AdaGrad to normalize the updates made during training, preventing any single weight from rising too high compared to the others.
This is particularly beneficial when dealing with sparse data, as the less frequently updated parameters are allowed larger updates when they do get updated, thereby effectively utilizing more neurons for training.

However, it’s important to note that AdaGrad has a tendency to decrease the learning rate quite aggressively due to the constant accumulation of the square of gradients in \(G_{t}\).
This can sometimes lead to premature and excessive decay of the learning rate during training, causing the model to stop learning before reaching the optimal point.
This monotonic decrease in the learning rate is one reason AdaGrad is not as widely used, except in some specific applications.

To summarize, AdaGrad adds a valuable tool to our optimization toolkit by providing an adaptive learning rate for each individual parameter.
It elegantly solves the problem of learning rate selection and normalization of parameter updates, and while it has some limitations, it’s a
powerful concept that has paved the way for further innovations in optimization algorithms.


source

AdaGrad

 AdaGrad (params, lr=0.001, wd=0.0, eps=1e-07)

Implements AdaGrad optimization algorithm.

AdaGrad is an optimizer with parameter-wise learning rates, which adapts the learning rate based on how frequently a parameter gets updated during training. It’s particularly useful for sparse data.

Type Default Details
params Iterable The parameters of the model to be optimized.
lr float 0.001 The initial learning rate.
wd float 0.0 The weight decay (L2 regularization).
eps float 1e-07 A small constant for numerical stability.

RMSProp Optimizer

RMSProp, short for Root Mean Square Propagation, which is an optimization algorithm that introduces an adaptive learning rate for each parameter in a model.

RMSProp introduces an adaptive learning rate for each parameter to tackle different landscapes of the loss function. It does this by maintaining a moving (or ‘running’) average
of the squared gradients, effectively measuring the scale of recent gradients. This running average, also known as the cache, is calculated as follows:

\[ cache_{t} = \rho \cdot cache_{t-1} + (1-\rho) \cdot (g_{t})^2 \]

where \(\rho\) is the decay rate that determines how much of the history of squared gradients we retain. This cache term holds a form of “memory” of the magnitude of recent gradients, and its contents “move” with the data over time.

Then, the parameter update rule becomes:

\[ \theta_{t} = \theta_{t-1} - \frac{\alpha}{\sqrt{cache_{t} + \epsilon}} \cdot g_{t} \]

where \(\epsilon\) is a small constant for numerical stability, often around \(1e-8\). This normalization by the square root of the cache ensures smooth changes in the learning rate and
helps retain the global direction of parameter updates. This adaptivity makes the learning rate changes more resilient to fluctuations in the gradient.

RMSProp introduces a new hyperparameter, \(\rho\), the cache memory decay rate. Given the momentum-like properties of RMSProp, even small gradient updates can have substantial effects
due to the adaptive learning rate updates. As such, the default learning rate often used with RMSProp is smaller, around \(0.001\), to ensure stability.


source

RMSProp

 RMSProp (params, lr=0.001, wd=0.0, eps=1e-07, rho=0.9)

Implements RMSProp optimization algorithm.

RMSProp is an optimizer with parameter-wise adaptive learning rates, which adapt the learning rate for each parameter individually, making it suitable for dealing with sparse or multi-scale data.

Type Default Details
params Iterable The parameters of the model to be optimized.
lr float 0.001 The initial learning rate.
wd float 0.0 The weight decay (L2 regularization).
eps float 1e-07 A small constant for numerical stability.
rho float 0.9 The decay rate for the moving average of squared gradients.

Adam Optimizer

This is a PyTorch-like implementation of popular optimizer Adam from paper Adam: A Method for Stochastic Optimization.

Adam update is, \[ \begin{align} m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) \cdot g_t \\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) \cdot g_t^2 \\ \hat{m}_t &\leftarrow \frac{m_t}{1-\beta_1^t} \\ \hat{v}_t &\leftarrow \frac{v_t}{1-\beta_2^t} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{align} \] where \(\alpha\), \(\beta_1\), \(\beta_2\) and \(\epsilon\) are scalar hyper parameters. \(m_t\) and \(v_t\) are first and second order moments. \(\hat{m}_t\) and \(\hat{v}_t\) are biased corrected moments. \(\epsilon\) is used as a fix for division by zero error, but also acts as a form of a hyper-parameter that acts against variance in gradients.

Effective step taken assuming \(\epsilon = 0\) is, \[\Delta t = \alpha \cdot \frac{\hat{m}_t}{\hat{v}_t}\] This is bounded by, \[\vert \Delta t \vert \le \alpha \cdot \frac{1 - \beta_1}{\sqrt{1-\beta_2}}\] when \(1-\beta_1 \gt \sqrt{1-\beta_2}\) and \[\vert \Delta t\vert \le \alpha\] otherwise. And in most common scenarios, \[\vert \Delta t \vert \approx \alpha\]

/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Attributes
  else: warn(msg)

source

Adam

 Adam (params, lr=1e-05, beta1=0.9, beta2=0.999, eps=1e-08,
       weight_decay=0.0)

Implements the Adam optimization algorithm.

Adam is an adaptive learning rate optimization algorithm that has been designed specifically for training deep neural networks. It leverages the power of adaptive learning rates methods to find individual learning rates for each parameter.

Type Default Details
params Iterable params is the list of parameters
lr float 1e-05 lr is the learning rate \(\alpha\)
beta1 float 0.9 The exponential decay rate for the first moment estimates. Default is 0.9.
beta2 float 0.999 The exponential decay rate for the second moment estimates. Default is 0.999.
eps float 1e-08 eps is \(\hat{\epsilon}\) or \(\epsilon\) based on optimized_update
weight_decay float 0.0 is an instance of class WeightDecay defined in __init__.py