The init module in the minima (mi) library provides a suite of tensor initialization functions to create and initialize tensors in various ways. Each function in this module represents a different strategy for initializing the values of a tensor, such as uniform or normal random values, constant values, or specialized initializations like Xavier or Kaiming methods.
rand: This function generates a tensor filled with random numbers drawn from a uniform distribution between low and high (defaulting to 0 and 1). It does this by creating an array of random values on the specified device (defaulting to CPU), then scales and shifts these values to the correct range. The result is wrapped in a mi.Tensor object, which supports automatic differentiation if requires_grad is True.
randn: Similar to rand, but generates numbers from a normal distribution with the specified mean and standard deviation (defaulting to 0 and 1). This is done by creating an array of normally-distributed random values, then scaling and shifting them to match the requested parameters.
constant: This function creates a tensor filled with a constant value c (defaulting to 1). It does this by creating an array of ones on the specified device and then scaling these ones by the constant value.
A tensor of shape shape, filled with the constant value c.
ones and zeros: These functions are simply shortcuts for creating tensors filled with ones or zeros, respectively. They’re implemented by calling the constant function with c set to 1 or 0.
randb: This function creates a binary tensor, with each element independently being True with probability p (defaulting to 0.5). This is done by generating uniformly-distributed random numbers and checking whether they’re less than or equal to p.
Generates a binary tensor with random values of True or False.
Type
Default
Details
shape
p
float
0.5
device
NoneType
None
dtype
str
bool
requires_grad
bool
False
Returns
mi.Tensor
A binary tensor of shape shape, filled with random boolean values, where the probability of True is p.
one_hot: This function creates a one-hot encoding tensor. Given a size n and an index i, it creates a tensor of size n with a 1 at the i-th position and 0s elsewhere.
one_hot (n, i, device=None, dtype='float32', requires_grad=False)
Generates a one-hot encoding tensor.
Type
Default
Details
n
int
The size of the one-hot vector.
i
int
The index to be set to 1 in the one-hot vector.
device
NoneType
None
The device where the tensor will be allocated. Default is CPU.
dtype
str
float32
The data type of the tensor. Default is ‘float32’.
requires_grad
bool
False
If True, the tensor is created with gradient tracking. Default is False.
Returns
mi.Tensor
A one-hot tensor of size n, with the ith element set to 1 and all others set to 0.
Glorot/Xavier Initialization
Xavier initialization, also known as Glorot initialization, is a technique for initializing the weights in artificial neural networks to improve the stability and speed of neural network training. In the paper Understanding the difficulty of training deep feedforward neural networks, researchers identified a value for the variance of the weights that works well to mitigate the problems we’ve discussed.
Here’s a high-level idea of how it works:
Neural networks are trained using a method called backpropagation, which involves iteratively adjusting the weights of the network based on the difference between the network’s current output and its desired output.
One challenge with this process is that the scale of the initial weights can have a large impact on the network’s learning dynamics. If the weights are too large or too small, the network might learn very slowly, or not at all. This is particularly an issue in deep networks where there are many layers of weights to learn.
Xavier initialization seeks to address this issue by scaling the initial weights in proportion to the number of inputs and outputs of the neuron. Specifically, in Xavier initialization, the weights are drawn from a distribution with zero a mean of 0 and a variance defined as:
\[
\text{var}(w)=\frac{2}{n_{in}+n_{out}}
\]
where \(n_{in}\) is the number of inputs to the neuron and \(n_{out}\) is the number of outputs. In order to induce the weights to acquire a standard deviation of \(\sqrt{\frac{2}{n_{in}+n_{out}}}\), consequently causing a variance of \(\frac{2}{n_{in}+n_{out}}\), the weights are initially produced randomly from a normal distribution with a mean of 0 and a standard deviation of 1.
Subsequently, every weight is multiplied by \(\sqrt{\frac{2}{n_{in}+n_{out}}}\), effectively shifting the standard deviation of the distribution to \(\sqrt{\frac{2}{n_{in}+n_{out}}}\).
Initializes a tensor using Xavier (Glorot) Normal initialization.
This initializer is designed to keep the scale of the gradients roughly the same in all layers. It samples weights from a normal distribution centered around 0 with standard deviation gain * sqrt(2 / (fan_in + fan_out))
Type
Default
Details
fan_in
int
The number of input units in the weight tensor.
fan_out
int
The number of output units in the weight tensor.
gain
float
1.0
Scaling factor for the standard deviation of the normal distribution. Default is 1.0.
kwargs
Returns
mi.Tensor
A tensor initialized using Xavier Normal initialization.
It’s worth noting that there is also a Xavier initialization variant suitable for uniform distributions as opposed to normal distributions. The resultant weight matrix will comprise values sampled from a uniform distribution within the scope of \((-a, a)\), with \(a\) equalling \(\sqrt{\frac{6}{n_{in}+n_{out}}}\).
Initializes a tensor using Xavier (Glorot) Uniform initialization.
This initializer is designed to keep the scale of the gradients roughly the same in all layers. It samples weights from a uniform distribution within the range [-gain * sqrt(6 / (fan_in + fan_out)), gain * sqrt(6 / (fan_in + fan_out))]
Type
Default
Details
fan_in
int
The number of input units in the weight tensor.
fan_out
int
The number of output units in the weight tensor.
gain
float
1.0
Scaling factor for the range of the uniform distribution. Default is 1.0.
kwargs
Returns
mi.Tensor
A tensor initialized using Xavier Uniform initialization.
Both normal and uniform distributions have demonstrated effectiveness in practical applications, and it is up to the network designer to select the preferred method. Xavier initialization is frequently utilized in practical scenarios to promote more stable training and circumvent issues that stem from unstable gradients, such as the vanishing and exploding gradient predicaments.
# Initialize weights with Xavier/Glorot initializationW = xavier_uniform(fan_in=10, fan_out=5)
The original Xavier initialization was designed for use with the sigmoid activation function, which is symmetric around zero. If you’re using a different activation function, like ReLU, you might need a different initialization scheme, like He initialization, which is a modification of Xavier initialization designed for ReLU and other non-symmetric activation functions.
He Initialization
Kaiming Initialization, also known as He Initialization, is a method used in initializing the weights of Neural Networks. This initialization method is designed specifically for neural networks with Rectified Linear Unit (ReLU) activation functions. It was proposed by Kaiming He et al. in their 2015 paper “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”.
Principles of Kaiming Initialization:
The basic idea of Kaiming Initialization is to keep the variance of the input and output of each layer of the neural network as consistent as possible during the forward and backward propagation. This is to solve the problem of gradient dispersion or explosion caused by the deepening of the neural network layer, which can help the model learn effectively.
Kaiming initialization initializes a weight matrix \(w\) with random values sampled from a normal distribution with mean of \(0\) and variance
\[\text{var}(w)=\frac{2}{n_{i}}\]
Here, n_i is the number of inputs to the neuron, w is the weight vector.
Just as with Xavier initialization, to force the weights distribution to take on this variance, the weights ar first randomly generated from a normal distribution with centered around 0 with a standard deviation of 1. Then, each weight is multiplied by
\[\sqrt{\frac{2}{n_{i}}}\]
where n is the number of inputs coming into a neuron (also known as the “fan-in”).
Fills the input Tensor with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al. (2015), using a normal distribution. The resulting tensor will have values sampled from normal distribution with mean=0 and std=sqrt(2 / fan_in).
Type
Default
Details
fan_in
int
Number of input units in the weight tensor.
fan_out
int
Number of output units in the weight tensor.
nonlinearity
str
relu
The non-linear function (nn.functional name), recommended to use only with ‘relu’ or ‘leaky_relu’. Default is ‘relu’.
kwargs
Returns
mi.Tensor
A tensor of shape (fan_in, fan_out), filled with random numbers from the normal distribution according to the Kaiming initialization.
There is also a version of Kaiming initialization to use for uniform distributions rather than normal distributions. The resulting weight matrix will have values sampled from a uniform distribution within the range \((-a, a)\), where
Fills the input Tensor with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al. (2015), using a uniform distribution. The resulting tensor will have values sampled from uniform distribution in the range [-std, std] where std = sqrt(2 / fan_in).
Type
Default
Details
fan_in
int
Number of input units in the weight tensor.
fan_out
int
Number of output units in the weight tensor.
nonlinearity
str
relu
The non-linear function (nn.functional name), recommended to use only with ‘relu’ or ‘leaky_relu’. Default is ‘relu’.
kwargs
Returns
mi.Tensor
A tensor of shape (fan_in, fan_out), filled with random numbers from the uniform distribution according to the Kaiming initialization.
Advantages of Kaiming Initialization:
It helps to keep the variance of the gradients roughly the same across all layers. This ensures that all layers in the network learn at about the same speed, avoiding the saturation of activation functions, and it can also help speed up the convergence of the network.
It performs better with ReLU and its variants because it accounts for the fact that the variance of the output of a neuron with a ReLU activation function is half the variance of its input.
def append_stats(hook, mod, inp, outp):ifnothasattr(hook,'stats'): hook.stats = ([],[]) acts = outp # TODO: move outp to cpu when USING ACCELERAOR!! :3 hook.stats[0].append(acts.numpy().mean()) hook.stats[1].append(acts.numpy().std())#| exportdef _lsuv_stats(hook, mod, inp, outp): acts = outp hook.mean = acts.numpy().mean() hook.std = acts.numpy().std()
class LSUV:def__init__(self, model, batch) ->None:self.model = modelself.batch = batchself.params_layers = [m for m in model ifhasattr(m, 'weight') andnotisinstance(m, mi.nn.BatchNorm1d)]self.act_fns = [m for m in model ifisinstance(m, mi.nn.ReLU)]# Constantsself.TOLERANCE =1e-3def lsuv_init(self):""" Layer-wise Sequential Unit Variance Initialization (LSUV). A method to help neural nets converge faster. Args: model : the model on which to perform LSUV initialization param_module : the module with trainable parameters to which the Hook is to be registered activation_module : the activation module to be initialized (ReLU, Sigmoid, etc.) input_data : input data to be passed through the model """for params_layer, acts_layer inzip(self.params_layers, self.act_fns): hook = Hook(acts_layer, _lsuv_stats)whileself.model(self.batch) isnotNoneand (abs(hook.std-1) >self.TOLERANCE orabs(hook.mean) >self.TOLERANCE):print(f'---> before: {hook.mean} -- {hook.std}')if params_layer.bias isnotNone: params_layer.bias -= mi.Tensor(hook.mean) params_layer.weight.data /= mi.Tensor(hook.std)print(f'-------------> after: {hook.mean} -- {hook.std}') hook.remove()
import numpy as npimport minima as mi# Number of samplesn_samples =1000# Number of features (28x28 pixels for a grayscale image)n_features =784# Number of classesn_classes =10# Generate random inputs from a standard normal distributionX = mi.init.randn(n_samples, n_features)# Generate random target classesy = mi.Tensor(np.random.randint(0, n_classes, size=n_samples))