init

The init module in the minima (mi) library provides a suite of tensor initialization functions to create and initialize tensors in various ways. Each function in this module represents a different strategy for initializing the values of a tensor, such as uniform or normal random values, constant values, or specialized initializations like Xavier or Kaiming methods.

rand: This function generates a tensor filled with random numbers drawn from a uniform distribution between low and high (defaulting to 0 and 1). It does this by creating an array of random values on the specified device (defaulting to CPU), then scales and shifts these values to the correct range. The result is wrapped in a mi.Tensor object, which supports automatic differentiation if requires_grad is True.

source

rand

 rand (*shape, low=0.0, high=1.0, device=None, dtype='float32',
       requires_grad=False)

Generates a tensor with random numbers uniformly distributed between low and high.

	Type	Default	Details
shape
low	float	0.0
high	float	1.0
device	NoneType	None
dtype	str	float32
requires_grad	bool	False
Returns	mi.Tensor		A tensor of shape `shape`, filled with random numbers from the uniform distribution between `low` and `high`.

rand(10,5)

minima.Tensor(
[[0.423019 0.831303 0.593536 0.464066 0.622164]
 [0.519762 0.698    0.364592 0.593321 0.299263]
 [0.330883 0.566039 0.327606 0.069224 0.077561]
 [0.591434 0.092411 0.049555 0.729441 0.001867]
 [0.60242  0.36611  0.162999 0.602054 0.684817]
 [0.545608 0.415636 0.746867 0.923219 0.67769 ]
 [0.809501 0.496377 0.527514 0.333276 0.479529]
 [0.080732 0.63581  0.950788 0.387371 0.570476]
 [0.677467 0.620451 0.702335 0.071747 0.067357]
 [0.66082  0.372642 0.226082 0.687941 0.761832]])

t = rand(10,5)

t.dtype, t.device, t.requires_grad

(dtype('float32'), minima.cpu(), False)

randn: Similar to rand, but generates numbers from a normal distribution with the specified mean and standard deviation (defaulting to 0 and 1). This is done by creating an array of normally-distributed random values, then scaling and shifting them to match the requested parameters.

source

randn

 randn (*shape, mean=0.0, std=1.0, device=None, dtype='float32',
        requires_grad=False)

Generates a tensor with random numbers normally distributed with specified mean and standard deviation.

	Type	Default	Details
shape
mean	float	0.0
std	float	1.0
device	NoneType	None
dtype	str	float32
requires_grad	bool	False
Returns	mi.Tensor		A tensor of shape `shape`, filled with random numbers from the normal distribution with the specified mean and standard deviation.

t = randn(5,5, requires_grad=True)

minima.Tensor(
[[ 0.934699 -1.883731  1.56695   0.929079  0.73024 ]
 [ 2.066303  0.109121  1.161415 -1.184726 -1.753147]
 [ 0.339952 -1.125624 -0.740886 -0.808628  0.024874]
 [-0.307566  1.072183  0.013086  0.407447 -0.705648]
 [-0.956348 -0.291481 -0.1       0.70653   0.500862]])

t.shape, t.dtype, t.device, t.requires_grad

((5, 5), dtype('float32'), minima.cpu(), True)

constant: This function creates a tensor filled with a constant value c (defaulting to 1). It does this by creating an array of ones on the specified device and then scaling these ones by the constant value.

source

constant

 constant (*shape, c=1.0, device=None, dtype='float32',
           requires_grad=False)

Generates a tensor filled with a constant value.

	Type	Default	Details
shape
c	float	1.0
device	NoneType	None
dtype	str	float32
requires_grad	bool	False
Returns	mi.Tensor		A tensor of shape `shape`, filled with the constant value `c`.

ones and zeros: These functions are simply shortcuts for creating tensors filled with ones or zeros, respectively. They’re implemented by calling the constant function with c set to 1 or 0.

source

ones

 ones (*shape, device=None, dtype='float32', requires_grad=False)

Generates a tensor filled with ones.

	Type	Default	Details
shape
device	NoneType	None
dtype	str	float32
requires_grad	bool	False
Returns	mi.Tensor		A tensor of shape `shape`, filled with ones.

source

zeros

 zeros (*shape, device=None, dtype='float32', requires_grad=False)

Generates a tensor filled with zeros.

	Type	Default	Details
shape
device	NoneType	None
dtype	str	float32
requires_grad	bool	False
Returns	mi.Tensor		A tensor of shape `shape`, filled with zeros.

randb: This function creates a binary tensor, with each element independently being True with probability p (defaulting to 0.5). This is done by generating uniformly-distributed random numbers and checking whether they’re less than or equal to p.

source

randb

 randb (*shape, p=0.5, device=None, dtype='bool', requires_grad=False)

Generates a binary tensor with random values of True or False.

	Type	Default	Details
shape
p	float	0.5
device	NoneType	None
dtype	str	bool
requires_grad	bool	False
Returns	mi.Tensor		A binary tensor of shape `shape`, filled with random boolean values, where the probability of `True` is `p`.

one_hot: This function creates a one-hot encoding tensor. Given a size n and an index i, it creates a tensor of size n with a 1 at the i-th position and 0s elsewhere.

source

one_hot

 one_hot (n, i, device=None, dtype='float32', requires_grad=False)

Generates a one-hot encoding tensor.

	Type	Default	Details
n	int		The size of the one-hot vector.
i	int		The index to be set to `1` in the one-hot vector.
device	NoneType	None	The device where the tensor will be allocated. Default is CPU.
dtype	str	float32	The data type of the tensor. Default is ‘float32’.
requires_grad	bool	False	If True, the tensor is created with gradient tracking. Default is False.
Returns	mi.Tensor		A one-hot tensor of size `n`, with the `i`th element set to `1` and all others set to `0`.

Glorot/Xavier Initialization

Xavier initialization, also known as Glorot initialization, is a technique for initializing the weights in artificial neural networks to improve the stability and speed of neural network training. In the paper Understanding the difficulty of training deep feedforward neural networks, researchers identified a value for the variance of the weights that works well to mitigate the problems we’ve discussed.

Here’s a high-level idea of how it works:

Neural networks are trained using a method called backpropagation, which involves iteratively adjusting the weights of the network based on the difference between the network’s current output and its desired output.

One challenge with this process is that the scale of the initial weights can have a large impact on the network’s learning dynamics. If the weights are too large or too small, the network might learn very slowly, or not at all. This is particularly an issue in deep networks where there are many layers of weights to learn.

Xavier initialization seeks to address this issue by scaling the initial weights in proportion to the number of inputs and outputs of the neuron. Specifically, in Xavier initialization, the weights are drawn from a distribution with zero a mean of 0 and a variance defined as:

\[ \text{var}(w)=\frac{2}{n_{in}+n_{out}} \]

where \(n_{in}\) is the number of inputs to the neuron and \(n_{out}\) is the number of outputs. In order to induce the weights to acquire a standard deviation of \(\sqrt{\frac{2}{n_{in}+n_{out}}}\), consequently causing a variance of \(\frac{2}{n_{in}+n_{out}}\), the weights are initially produced randomly from a normal distribution with a mean of 0 and a standard deviation of 1.

Subsequently, every weight is multiplied by \(\sqrt{\frac{2}{n_{in}+n_{out}}}\), effectively shifting the standard deviation of the distribution to \(\sqrt{\frac{2}{n_{in}+n_{out}}}\).

Xavier initialization from a normal distribution

source

xavier_normal

 xavier_normal (fan_in, fan_out, gain=1.0, **kwargs)

Initializes a tensor using Xavier (Glorot) Normal initialization.

This initializer is designed to keep the scale of the gradients roughly the same in all layers. It samples weights from a normal distribution centered around 0 with standard deviation gain * sqrt(2 / (fan_in + fan_out))

	Type	Default	Details
fan_in	int		The number of input units in the weight tensor.
fan_out	int		The number of output units in the weight tensor.
gain	float	1.0	Scaling factor for the standard deviation of the normal distribution. Default is 1.0.
kwargs
Returns	mi.Tensor		A tensor initialized using Xavier Normal initialization.

It’s worth noting that there is also a Xavier initialization variant suitable for uniform distributions as opposed to normal distributions. The resultant weight matrix will comprise values sampled from a uniform distribution within the scope of \((-a, a)\), with \(a\) equalling \(\sqrt{\frac{6}{n_{in}+n_{out}}}\).

Xavier initialization from a uniform distribution

source

xavier_uniform

 xavier_uniform (fan_in, fan_out, gain=1.0, **kwargs)

Initializes a tensor using Xavier (Glorot) Uniform initialization.

This initializer is designed to keep the scale of the gradients roughly the same in all layers. It samples weights from a uniform distribution within the range [-gain * sqrt(6 / (fan_in + fan_out)), gain * sqrt(6 / (fan_in + fan_out))]

	Type	Default	Details
fan_in	int		The number of input units in the weight tensor.
fan_out	int		The number of output units in the weight tensor.
gain	float	1.0	Scaling factor for the range of the uniform distribution. Default is 1.0.
kwargs
Returns	mi.Tensor		A tensor initialized using Xavier Uniform initialization.

Both normal and uniform distributions have demonstrated effectiveness in practical applications, and it is up to the network designer to select the preferred method. Xavier initialization is frequently utilized in practical scenarios to promote more stable training and circumvent issues that stem from unstable gradients, such as the vanishing and exploding gradient predicaments.

# Initialize weights with Xavier/Glorot initialization
W = xavier_uniform(fan_in=10, fan_out=5)

minima.Tensor(
[[-0.221499 -0.06155  -0.077118  0.56846  -0.418471]
 [-0.149945 -0.553442  0.581115 -0.460948 -0.420142]
 [-0.355427 -0.066154  0.355814  0.082557 -0.556673]
 [-0.497098 -0.087087 -0.051234 -0.238323 -0.290452]
 [-0.280464  0.334714  0.116377 -0.481387 -0.388613]
 [ 0.110984  0.625096 -0.228138 -0.500467  0.502594]
 [-0.400704  0.197745  0.166157 -0.479262  0.577242]
 [-0.196405 -0.577416 -0.605291 -0.294985 -0.606795]
 [ 0.321031 -0.098246  0.278399  0.047973  0.295106]
 [-0.385705  0.34554  -0.519177  0.389492  0.040751]])

W = xavier_normal(fan_in=10, fan_out=5)

minima.Tensor(
[[-0.063341 -0.374345 -0.352547  0.789786 -0.782333]
 [ 0.195202 -0.227442  0.252176  0.225021  0.105454]
 [-0.174635 -0.043868  0.195862 -0.917395  0.502817]
 [ 0.177671  0.373282  0.601478  0.593381 -0.1945  ]
 [ 0.056075  0.224731 -0.458342 -0.133477 -0.138557]
 [ 0.393741  0.096873  0.514728  0.195166 -0.260037]
 [-0.161321 -0.056462  0.609632 -0.470343  0.118147]
 [-0.359394 -0.297816  0.383475  0.310443  0.510362]
 [-0.433323 -0.49009  -0.362796  0.263523 -0.023001]
 [-0.156537  0.312429 -0.113006 -0.195525  0.197912]])

The original Xavier initialization was designed for use with the sigmoid activation function, which is symmetric around zero. If you’re using a different activation function, like ReLU, you might need a different initialization scheme, like He initialization, which is a modification of Xavier initialization designed for ReLU and other non-symmetric activation functions.

He Initialization

Kaiming Initialization, also known as He Initialization, is a method used in initializing the weights of Neural Networks. This initialization method is designed specifically for neural networks with Rectified Linear Unit (ReLU) activation functions. It was proposed by Kaiming He et al. in their 2015 paper “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”.

Principles of Kaiming Initialization:

The basic idea of Kaiming Initialization is to keep the variance of the input and output of each layer of the neural network as consistent as possible during the forward and backward propagation. This is to solve the problem of gradient dispersion or explosion caused by the deepening of the neural network layer, which can help the model learn effectively.

Kaiming initialization initializes a weight matrix \(w\) with random values sampled from a normal distribution with mean of \(0\) and variance

\[\text{var}(w)=\frac{2}{n_{i}}\]

Here, n_i is the number of inputs to the neuron, w is the weight vector.

Just as with Xavier initialization, to force the weights distribution to take on this variance, the weights ar first randomly generated from a normal distribution with centered around 0 with a standard deviation of 1. Then, each weight is multiplied by

\[\sqrt{\frac{2}{n_{i}}}\]

Kaiming initialization from a normal distributiont

where n is the number of inputs coming into a neuron (also known as the “fan-in”).

source

kaiming_normal

 kaiming_normal (fan_in, fan_out, nonlinearity='relu', **kwargs)

Fills the input Tensor with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al. (2015), using a normal distribution. The resulting tensor will have values sampled from normal distribution with mean=0 and std=sqrt(2 / fan_in).

	Type	Default	Details
fan_in	int		Number of input units in the weight tensor.
fan_out	int		Number of output units in the weight tensor.
nonlinearity	str	relu	The non-linear function (`nn.functional` name), recommended to use only with ‘relu’ or ‘leaky_relu’. Default is ‘relu’.
kwargs
Returns	mi.Tensor		A tensor of shape (fan_in, fan_out), filled with random numbers from the normal distribution according to the Kaiming initialization.

There is also a version of Kaiming initialization to use for uniform distributions rather than normal distributions. The resulting weight matrix will have values sampled from a uniform distribution within the range \((-a, a)\), where

\[a = \sqrt{\frac{6}{n_{i}}}\]

Kaiming initialization from a uniform distributiont

source

kaiming_uniform

 kaiming_uniform (fan_in, fan_out, nonlinearity='relu', **kwargs)

Fills the input Tensor with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al. (2015), using a uniform distribution. The resulting tensor will have values sampled from uniform distribution in the range [-std, std] where std = sqrt(2 / fan_in).

	Type	Default	Details
fan_in	int		Number of input units in the weight tensor.
fan_out	int		Number of output units in the weight tensor.
nonlinearity	str	relu	The non-linear function (`nn.functional` name), recommended to use only with ‘relu’ or ‘leaky_relu’. Default is ‘relu’.
kwargs
Returns	mi.Tensor		A tensor of shape (fan_in, fan_out), filled with random numbers from the uniform distribution according to the Kaiming initialization.

Advantages of Kaiming Initialization:

It helps to keep the variance of the gradients roughly the same across all layers. This ensures that all layers in the network learn at about the same speed, avoiding the saturation of activation functions, and it can also help speed up the convergence of the network.
It performs better with ReLU and its variants because it accounts for the fact that the variance of the output of a neuron with a ReLU activation function is half the variance of its input.

LSUV Initialization

class Hook():
    def __init__(self, layer, fn): self.hook = layer.register_forward_hook(partial(fn, self))
    # def remove(self): self.hook.remove()
    # def __del__(self): self.remove()

def append_stats(hook, mod, inp, outp):
    if not hasattr(hook,'stats'): hook.stats = ([],[])
    acts = outp # TODO: move outp to cpu when USING ACCELERAOR!! :3
    hook.stats[0].append(acts.numpy().mean())
    hook.stats[1].append(acts.numpy().std())

#| export
def _lsuv_stats(hook, mod, inp, outp):
    acts = outp
    hook.mean = acts.numpy().mean()
    hook.std = acts.numpy().std()

class LSUV:

    def __init__(self, model, batch) -> None:
        self.model = model
        self.batch = batch
        self.params_layers = [m for m in model if hasattr(m, 'weight') and not isinstance(m, mi.nn.BatchNorm1d)]
        self.act_fns = [m for m in model if isinstance(m, mi.nn.ReLU)]
        
        # Constants
        self.TOLERANCE = 1e-3
        

    def lsuv_init(self):
        """
        Layer-wise Sequential Unit Variance Initialization (LSUV).
        A method to help neural nets converge faster.
        
        Args:
        model : the model on which to perform LSUV initialization
        param_module : the module with trainable parameters to which the Hook is to be registered
        activation_module : the activation module to be initialized (ReLU, Sigmoid, etc.)
        input_data : input data to be passed through the model
        """
        for params_layer, acts_layer in zip(self.params_layers, self.act_fns):
            hook = Hook(acts_layer, _lsuv_stats)
            while self.model(self.batch) is not None and (abs(hook.std-1) > self.TOLERANCE or abs(hook.mean) > self.TOLERANCE):
                print(f'---> before: {hook.mean} -- {hook.std}')
                if params_layer.bias is not None: params_layer.bias -=  mi.Tensor(hook.mean)
                params_layer.weight.data /= mi.Tensor(hook.std)
        print(f'-------------> after: {hook.mean} -- {hook.std}')
        hook.remove()

import numpy as np
import minima as mi

# Number of samples
n_samples = 1000

# Number of features (28x28 pixels for a grayscale image)
n_features = 784

# Number of classes
n_classes = 10

# Generate random inputs from a standard normal distribution
X = mi.init.randn(n_samples, n_features)

# Generate random target classes
y = mi.Tensor(np.random.randint(0, n_classes, size=n_samples))

# Define the neural network architecture
model = mi.nn.Sequential(
    mi.nn.Linear(784, 128),
    mi.nn.ReLU(),
    mi.nn.Linear(128, 10)
)

lsuv = LSUV(model, X)

TypeError: 'Sequential' object is not iterable

# lsuv.lsuv_init()

Export

import nbdev; nbdev.nbdev_export()