operators

The operators module in this framework provides a collection of tensor operations for building computational graphs in deep learning. Each class in this module represents a different type of operation that can be performed on tensors, such as element-wise addition, scalar multiplication, division, exponentiation, etc.

Note about the out_grad parameter

During backpropagation in a neural network, we compute gradients starting from the output layer and propagate them back towards the input layer. The key idea here is that each layer receives the gradient of the loss with respect to its output (let’s call this out_grad), and it needs to compute and pass back the gradient of the loss with respect to its input (let’s call this in_grad). This is needed so that the parameters of each layer can be updated correctly during gradient descent.

The out_grad parameter refers to the gradient of the loss function with respect to the output of the node. Multiplying this with the local gradient gives the gradient of the loss with respect to the input to the node, according to the chain rule of calculus, which is the basis for backpropagation in neural networks.

The chain rule is a fundamental concept in calculus that provides a method to compute the derivative of composite functions. In simple terms, the chain rule states that the derivative of a composite function is the derivative of the outer function multiplied by the derivative of the inner function.

Given a composite function that is the composition of two functions, say, \(f(g(x))\), the chain rule can be stated as follows:

\[\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\]

Where:

  • \(\frac{df}{dx}\) is the derivative of the composite function \(f(g(x))\) with respect to \(x\),
  • \(\frac{df}{dg}\) is the derivative of the outer function \(f\) with respect to its argument \(g(x)\), and
  • \(\frac{dg}{dx}\) is the derivative of the inner function \(g(x)\) with respect to \(x\).

The chain rule can be extended to the case where we have more than two composite functions.

Element Wise Addition

Let’s walk through the step-by-step derivative calculation for the EWiseAdd operation:

We have the function f(a, b) = a + b, where a and b are tensors. Our goal is to compute the partial derivatives with respect to a and b.

Let’s start by calculating the derivative of f with respect to a, denoted as df/da:

Step 1: Compute the derivative of f with respect to a.

\(\frac{{\partial f}}{{\partial a}} = \frac{{\partial}}{{\partial a}} (a + b)\)

Since a is the variable we are differentiating with respect to, the derivative of a with respect to itself is 1:

\[\frac{{\partial f}}{{\partial a}} = 1\]

Therefore, \[\frac{{\partial f}}{{\partial a}} = 1.\]

Step 2: Compute the derivative of f with respect to b.

\[\frac{{\partial f}}{{\partial b}} = \frac{{\partial}}{{\partial b}} (a + b)\]

Again, since b is the variable we are differentiating with respect to, the derivative of b with respect to itself is 1:

\[\frac{{\partial f}}{{\partial b}} = 1\]

Therefore, \[\frac{{\partial f}}{{\partial b}} = 1\]

Hence, the partial derivatives of f(a, b) = a + b with respect to a and b are both equal to 1.


source

add

 add (a:minima.autograd.Tensor, b:minima.autograd.Tensor)

Adds two tensors element-wise.

Args: - a: The first tensor. - b: The second tensor.

Returns: The element-wise sum of a and b.


source

EWiseAdd

 EWiseAdd ()

Performs element-wise addition of two tensors.

Example: >>> a = Tensor([1, 2, 3]) >>> b = Tensor([4, 5, 6]) >>> op = EWiseAdd() >>> result = op.compute(a, b) >>> print(result) Tensor([5, 7, 9])

Create two 1-D tensors

a = Tensor([1, 2, 3])
b = Tensor([4, 5, 6])

Create an EWiseAdd operation instance

op = EWiseAdd()

Compute the element-wise sum of a and b

result = op.compute(a, b)
result
minima.Tensor(
[5 7 9])

Alternatively, you can use the add function directly

result = add(a, b)
result
minima.Tensor(
[5 7 9])

or

op(a,b)
minima.Tensor(
[5 7 9])

For 2-D tensors, we can compute the element-wise sum of a and b in the same way

a = Tensor([[1, 2, 3], [4, 5, 6]])
b = Tensor([[7, 8, 9], [10, 11, 12]])

result = op.compute(a, b)
result
minima.Tensor(
[[ 8 10 12]
 [14 16 18]])

Scalar Addition

Explanation for the derivative of the AddScalar operator:

Let’s denote the scalar as c and a as the tensor being added by the scalar. The operation can be described as f(a) = a + c.

The function for the backward pass (i.e., the gradient) is df/da = 1, which means the derivative of f(a) with respect to a is simply 1.

We are given a function \(f(a) = a + c\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).

By differentiating the function \(f(a)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a + c) \\ &= 1 \end{align*}\]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(1\).

We starts by defining the function f(a) = a + c. It then explains that when we differentiate f(a) with respect to a, we find that the derivative is 1. This means that the gradient of f(a) with respect to a is 1, which matches the behavior of the AddScalar operator as provided in the gradient method.


source

add_scalar

 add_scalar (a:minima.autograd.Tensor, scalar:Union[int,float])

Adds a scalar to a tensor.

Args: - a: The tensor. - scalar: The scalar to add.

Returns: The sum of a and the scalar.


source

AddScalar

 AddScalar (scalar:Union[int,float])

Performs addition of a tensor and a scalar.

Example: >>> a = Tensor([1, 2, 3]) >>> op = AddScalar(5) >>> result = op.compute(a) >>> print(result) Tensor([6, 7, 8])

Element Wise Multiplication

Explanation for the derivative of the EWiseMul (element-wise multiplication) operator:

Let’s denote the two input tensors as a and b. The operation can be described as f(a, b) = a * b, where * represents element-wise multiplication.

The function for the backward pass (i.e., the gradient) is df/da = b and df/db = a. This means that the derivative of f(a, b) with respect to a is b, and the derivative with respect to b is a.

We are given a function \(f(a, b) = a \odot b\), where \(a\) and \(b\) are tensors, and \(\odot\) represents element-wise multiplication. Our task is to find the derivatives of this function with respect to \(a\) and \(b\).

By differentiating the function \(f(a, b)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a \odot b) \\ &= b \end{align*}\]

Therefore, the gradient of \(f(a, b)\) with respect to \(a\) is \(b\).

Similarly, by differentiating the function \(f(a, b)\) with respect to \(b\), we find:

\[\begin{align*} \frac{df}{db} &= \frac{d}{db} (a \odot b) \\ &= a \end{align*}\]

Therefore, the gradient of \(f(a, b)\) with respect to \(b\) is \(a\).


source

multiply

 multiply (a:minima.autograd.Tensor, b:minima.autograd.Tensor)

Multiplies two tensors element-wise.

Args: - a: The first tensor. - b: The second tensor.

Returns: The element-wise product of a and b.


source

EWiseMul

 EWiseMul ()

Performs element-wise multiplication of two tensors.

Example: >>> a = Tensor([1, 2, 3]) >>> b = Tensor([4, 5, 6]) >>> op = EWiseMul() >>> result = op.compute(a, b) >>> print(result) Tensor([4, 10, 18])

Scalar Multiplication

Let’s denote the scalar as c and a as the tensor being multiplied by the scalar. The operation can be described as f(a) = a * c.

The function for the backward pass (i.e., the gradient) is df/da = c, which means the derivative of f(a) with respect to a is c.

The LaTeX document will look as follows:

We are given a function \(f(a) = a \cdot c\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).

By differentiating the function \(f(a)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a \cdot c) \\ &= c \end{align*}\]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(c\).

We starts by defining the function f(a) = a * c. It then explains that when we differentiate f(a) with respect to a, we find that the derivative is c. This means that the gradient of f(a) with respect to a is c, which matches the behavior of the MulScalar operator as provided in the gradient method.


source

mul_scalar

 mul_scalar (a:minima.autograd.Tensor, scalar:Union[int,float])

Multiplies a tensor by a scalar.

Args: - a: The tensor. - scalar: The scalar to multiply.

Returns: The product of a and the scalar.


source

MulScalar

 MulScalar (scalar:Union[int,float])

Performs multiplication of a tensor and a scalar.

Example: >>> a = Tensor([1, 2, 3]) >>> op = MulScalar(5) >>> result = op.compute(a) >>> print(result) Tensor([5, 10, 15])

Element Wise Divide

The operation described here is an element-wise division of two tensors, a and b, where the operation can be described as f(a, b) = a / b.

We’ll compute the partial derivatives with respect to a and b:

  1. The partial derivative of f(a, b) with respect to a (df/da) is 1/b.

  2. The partial derivative of f(a, b) with respect to b (df/db) is -a / b^2.

We are given a function \(f(a, b) = \frac{a}{b}\), where \(a\) and \(b\) are tensors. Our task is to find the partial derivatives of this function with respect to \(a\) and \(b\).

Let’s start with \(\frac{\partial f}{\partial a}\):

\[\begin{align*} \frac{\partial f}{\partial a} &= \frac{\partial}{\partial a} \left(\frac{a}{b}\right) \\ &= \frac{1}{b} \end{align*}\]

Now, let’s compute \(\frac{\partial f}{\partial b}\):

\[\begin{align*} \frac{\partial f}{\partial b} &= \frac{\partial}{\partial b} \left(\frac{a}{b}\right) \\ &= - \frac{a}{b^{2}} \end{align*}\]

Here is a detailed derivative:

Given a function of the form \(y = \frac{u}{v}\), where both \(u\) and \(v\) are functions of \(x\), the quotient rule of differentiation states:

\[\frac{dy}{dx} = \frac{v \cdot \frac{du}{dx} - u \cdot \frac{dv}{dx}}{v^2}\]

In our case, we’re looking at the function \(y = \frac{a}{b}\), where \(a\) and \(b\) are tensors. We want to find the derivative with respect to \(b\) (instead of \(x\) in our general formula). So we have:

\[\frac{dy}{db} = \frac{b \cdot \frac{da}{db} - a \cdot \frac{db}{db}}{b^2}\]

Since \(a\) does not depend on \(b\), \(\frac{da}{db} = 0\), and since any variable is equal to itself, \(\frac{db}{db} = 1\).

So the derivative \(\frac{dy}{db}\) simplifies to:

\[\frac{dy}{db} = \frac{b \cdot 0 - a \cdot 1}{b^2}\]

Therefore, the derivative of \(y\) with respect to \(b\) is \(-\frac{a}{b^2}\).

Therefore, the gradient of \(f(a, b)\) with respect to \(a\) is \(\frac{1}{b}\), and the gradient of \(f(a, b)\) with respect to \(b\) is \(- \frac{a}{b^{2}}\).


source

divide

 divide (a:minima.autograd.Tensor, b:minima.autograd.Tensor)

Divides two tensors element-wise.

Args: a (Tensor): The dividend tensor. b (Tensor): The divisor tensor.

Returns: Tensor: The resulting tensor after element-wise division.

Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> b = Tensor(np.array([4, 5, 6])) >>> result = divide(a, b) >>> print(result) Tensor([0.25, 0.4, 0.5])


source

EWiseDiv

 EWiseDiv ()

The EWiseDiv operation divides two tensors element-wise.

Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> b = Tensor(np.array([4, 5, 6])) >>> div = EWiseDiv() >>> result = div.compute(a.data, b.data) >>> print(result) array([0.25, 0.4, 0.5])

Scalar Division

Let’s denote the scalar as c, and a as the tensor being divided by the scalar. The operation can be described as f(a) = a / c.

The function for the backward pass (i.e., the gradient) is df/da = 1/c.

This is the derivative of f(a) with respect to a.

We are given a function \(f(a) = \frac{a}{c}\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).

By using the power rule of differentiation, where the derivative of \(a^n\) is \(n \cdot a^{n-1}\), we can rewrite \(f(a)\) as \(f(a) = c^{-1}a\).

Now, we can differentiate this with respect to \(a\):

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (c^{-1}a) \\ &= c^{-1} \frac{d}{da} (a) \\ &= c^{-1} \\ &= \frac{1}{c} \end{align*}\]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(\frac{1}{c}\).


source

divide_scalar

 divide_scalar (a:minima.autograd.Tensor, scalar:Union[int,float])

Divides a tensor by a scalar.

Args: a (Tensor): The tensor to divide. scalar (int, float): The scalar to divide the tensor by.

Returns: Tensor: The resulting tensor after division.

Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> scalar = 2 >>> result = divide_scalar(a, scalar) >>> print(result) Tensor([0.5, 1.0, 1.5])


source

DivScalar

 DivScalar (scalar:Union[int,float])

The DivScalar operation divides a tensor by a scalar.

Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> scalar = 2 >>> div_scalar = DivScalar(scalar) >>> result = div_scalar.compute(a.data) >>> print(result) array([0.5, 1.0, 1.5])

Negation

Let’s denote a as the tensor being negated. The operation can be described as f(a) = -a.

The function for the backward pass (i.e., the gradient) is df/da = -1.

We are given a function \(f(a) = -a\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).

By differentiating the function \(f(a)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (-a) \\ &= -1 \end{align*}\]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(-1\).


source

negate

 negate (a:minima.autograd.Tensor)

Negates the given tensor.

Args: - a: The tensor to negate.

Returns: The negation of a.

Example: >>> a = Tensor([1, -2, 3]) >>> result = negate(a) >>> print(result) Tensor([-1, 2, -3])


source

Negate

 Negate ()

Negates the given tensor.

Example: >>> a = Tensor([1, -2, 3]) >>> op = Negate() >>> result = op.compute(a) >>> print(result) Tensor([-1, 2, -3])

Exp

Explanation for the derivative of the Exp operator:

Let’s denote a as the tensor on which the exponential function is applied. The operation can be described as f(a) = exp(a), where exp represents the exponential function.

The function for the backward pass (i.e., the gradient) is df/da = exp(a).

We are given a function \(f(a) = \exp(a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).

By differentiating the function \(f(a)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (\exp(a)) \\ &= \exp(a) \end{align*}\]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(\exp(a)\).


source

exp

 exp (a:minima.autograd.Tensor)

Calculates the exponential of the given tensor.

Args: - a: The tensor.

Returns: The exponential of a.

Example: >>> a = Tensor([1, 2, 3]) >>> result = exp(a) >>> print(result) Tensor([2.71828183, 7.3890561, 20.08553692])


source

Exp

 Exp ()

Calculates the exponential of the given tensor.

Example: >>> a = Tensor([1, 2, 3]) >>> op = Exp() >>> result = op.compute(a) >>> print(result) Tensor([2.71828183, 7.3890561, 20.08553692])

ReLU

The derivative of the ReLU (Rectified Linear Unit) operator:

Let’s denote a as the tensor on which the ReLU function is applied. The ReLU function is defined as follows:

\[ f(a) = \begin{cases} a, & \text{if } a \geq 0 \\ 0, & \text{if } a < 0 \end{cases} \]

The function for the backward pass (i.e., the gradient) is df/da = 1 if a >= 0, and df/da = 0 if a < 0.

We are given a function \(f(a) = \max(0, a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).

By considering the definition of the ReLU function, we can write \(f(a)\) as:

\[ f(a) = \begin{cases} a, & \text{if } a \geq 0 \\ 0, & \text{if } a < 0 \end{cases} \]

Now, let’s differentiate \(f(a)\) with respect to \(a\):

\[ \frac{df}{da} = \begin{cases} 1, & \text{if } a \geq 0 \\ 0, & \text{if } a < 0 \end{cases} \]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(1\) if \(a \geq 0\), and \(0\) if \(a < 0\).


source

relu

 relu (a:minima.autograd.Tensor)

Applies the ReLU (Rectified Linear Unit) activation function to the given tensor.

Args: - a: The tensor.

Returns: The result of applying ReLU to a.

Example: >>> a = Tensor([1, -2, 3]) >>> result = relu(a) >>> print(result) Tensor([1, 0, 3])


source

ReLU

 ReLU ()

Applies the ReLU (Rectified Linear Unit) activation function to the given tensor.

Example: >>> a = Tensor([1, -2, 3]) >>> op = ReLU() >>> result = op.compute(a) >>> print(result) Tensor([1, 0, 3])

Power Scalar

The derivative of the PowerScalar operator:

Let’s denote the scalar as n and a as the tensor being raised to the power of the scalar. The operation can be described as f(a) = a^n.

The function for the backward pass (i.e., the gradient) is df/da = n * a^(n-1).

We are given a function \(f(a) = a^n\), where \(a\) is a tensor and \(n\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).

By differentiating the function \(f(a)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a^n) \\ &= n \cdot a^{n-1} \end{align*}\]

Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(n \cdot a^{n-1}\).


source

power_scalar

 power_scalar (a:minima.autograd.Tensor, scalar:int)

Raises a tensor to a power.

Args: a (Tensor): The input tensor. scalar (int): The power to raise the tensor to.

Returns: Tensor: The resulting tensor after the power operation.

Example: >>> import numpy as np >>> tensor = Tensor(np.array([1, 2, 3])) >>> result = power_scalar(tensor, 2) >>> print(result) Tensor([1, 4, 9])


source

PowerScalar

 PowerScalar (scalar:int)

The PowerScalar operation raises a tensor to an (integer) power.

Attributes: scalar (int): The power to raise the tensor to.

Example: >>> import numpy as np >>> tensor = Tensor(np.array([1, 2, 3])) >>> pow_scalar = PowerScalar(2) >>> result = pow_scalar.compute(tensor.data) >>> print(result) array([1, 4, 9])

Log

Explanation for the derivative of the Log operator:

Let’s denote a as the tensor on which the logarithm is applied. The operation can be described as f(a) = log(a), where log represents the natural logarithm.

The function for the backward pass (i.e., the gradient) is df/da = 1/a.

We are given a function \(f(a) = \log(a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).

By differentiating the function \(f(a)\) with respect to \(a\), we find:

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (\log(a)) \\ &= \frac{1}{a} \end{align*}\]

We started by defining the function f(a) = log(a), where log represents the natural logarithm. It then explains that when we differentiate f(a) with respect to a, we find that the derivative is 1/a. This means that the gradient of f(a) with respect to a is 1/a, which represents the behavior of the Log operator.

class Log(TensorOp):
    """
    The Log operation applies the natural logarithm element-wise on the tensor.

    Example:
        >>> import numpy as np
        >>> a = Tensor(np.array([1.0, 2.0, 3.0]))
        >>> log_op = Log()
        >>> result = log_op.compute(a.data)
        >>> print(result)
        array([0., 0.69314718, 1.09861229])
    """

    def compute(self, a: NDArray) -> NDArray:
        """
        Applies the natural logarithm to the tensor.

        Args:
            a (NDArray): The input tensor.

        Returns:
            NDArray: The resulting tensor after applying the natural logarithm.
        """
        return ARRAY_API.log(a)

    def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor, ...]:
        """
        Computes the gradient of the log operation.

        Args:
            out_grad (Tensor): The gradient of the output tensor.
            node (Tensor): The node in the computational graph where the operation was performed.

        Returns:
            Tuple[Tensor, ...]: The gradient with respect to the input tensor.
        """
        a = node.children[0]
        return (out_grad / a, )

def log(a: Tensor) -> Tensor:
    """
    Applies the natural logarithm to the tensor.

    Args:
        a (Tensor): The input tensor.

    Returns:
        Tensor: The resulting tensor after applying the natural logarithm.

    Example:
        >>> import numpy as np
        >>> a = Tensor(np.array([1.0, 2.0, 3.0]))
        >>> result = log(a)
        >>> print(result)
        Tensor([0., 0.69314718, 1.09861229])
    """
    return Log()(a)

Transpose

This operation described here is the derivative of a transposition operation. Let’s define our transposition operation as a function f such that f(a) = a^T where a is a tensor, and a^T is the transpose of tensor a.

The goal here is to compute the derivative of this operation with respect to a. It’s important to note that transposition operation doesn’t change the values of the tensor’s elements, but it just rearranges their positions. This implies that the gradient (derivative) of a transposed tensor is simply the transposed gradient of the original tensor.

Let’s denote the gradient of the transposed tensor as g, which can be mathematically represented as g = df/da, where df/da is the derivative of f(a) with respect to a.

Given this understanding, we can make an important conclusion:

  1. The derivative of f(a) with respect to a is df/da = g^T, meaning that the derivative of the transposed tensor is simply the transposed gradient of the original tensor.

This concept can be written in mathematical terms using LaTeX as follows:

We have a function \(f(a) = a^T\), where \(a\) is a tensor and \(a^T\) is its transpose. We want to find the derivative of this function with respect to \(a\), that is, compute \(\frac{df}{da}\).

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a^T) \\ &= (g)^T \end{align*}\]

In the equation above, \(g\) is the gradient of the transposed tensor. This equation indicates that the derivative of the transpose of a tensor is the transpose of the gradient of the original tensor.

Now, if we consider a Python class Transpose that implements this transposition operation, we would have a gradient method in the class that computes the derivative of the transpose operation. This method would apply the transpose function to out_grad, which represents the gradient of the output tensor, thereby giving us the transposed gradient of the original tensor. In the code, transpose(out_grad, axes=self.axes) performs the transposition of out_grad along the same axes that were used in the forward pass. Thus, the gradient of the transposition operation with respect to the input tensor a is computed as the transpose of the output gradient out_grad.


source

transpose

 transpose (a:minima.autograd.Tensor, axes:Optional[tuple]=None)

Perform the transpose operation on the input tensor along the specified axes. If no axes are specified, it swaps the last two dimensions of the input tensor.

Args: a (Tensor): The input tensor. axes (Optional[tuple]): The pair of axes that should be swapped. If not provided, the last two axes are swapped.

Returns: Tensor: The transposed tensor.

Example: >>> a = Tensor(np.arange(1, 7).reshape(2, 3)) >>> result = transpose(a) >>> print(result) Tensor([[1, 4], [2, 5], [3, 6]])


source

Transpose

 Transpose (axes:Optional[tuple]=None)

Tensor operation class that performs transposition of a tensor along specified axes.

If no axes are specified, it swaps the last two dimensions of the input tensor.

Example: >>> a = Tensor(np.arange(1, 7).reshape(2, 3)) >>> op = Transpose() >>> result = op.compute(a.data) >>> print(result) array([[1, 4], [2, 5], [3, 6]])

Reshape

The operation described here is a reshaping of a tensor a, where the operation can be described as f(a) = reshape(a, new_shape).

We’ll compute the derivative of this operation.

The reshaping operation doesn’t change the values of the tensor elements but only rearranges them. This means that the gradient of a reshaped tensor is just the reshaped gradient of the original tensor.

Let’s denote the gradient of the reshaped tensor as g = df/da, where f(a) = reshape(a, new_shape).

Given this, we can derive the following:

  1. The derivative of f(a) with respect to a is df/da = reshape(g, original_shape).

This conclusion can be illustrated as follows in Latex:

We are given a function \(f(a) = reshape(a, new\_shape)\), where \(a\) is a tensor and reshape(a, new_shape) is the reshaped tensor. Our task is to find the derivative of this function with respect to \(a\).

Let’s compute \(\frac{df}{da}\):

\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (reshape(a, new\_shape)) \\ &= reshape(g, original\_shape) \end{align*}\]

Here, \(g\) is the gradient of the reshaped tensor. The derivative of a reshaped tensor is the reshaped derivative of the original tensor. The reshaped derivative has the same shape as the original tensor.

Now, let’s apply this to the Reshape class.

The gradient method in the Reshape class computes the gradient of the reshape operation. The gradient of the reshaped tensor is just the reshaped gradient of the original tensor. This is implemented by applying the reshape function to out_grad, which is the gradient of the output tensor, and then returning this reshaped gradient. The shape used for the reshaping is the shape of the original tensor, which is obtained from node.children[0].shape.

Therefore, the gradient of the reshape operation with respect to the input tensor a is the reshaping of the output gradient out_grad to the shape of the original tensor.


source

reshape

 reshape (a:minima.autograd.Tensor, shape:Tuple[int,...])

Reshape the input tensor to the specified shape.

Args: a (Tensor): The input tensor. shape (Tuple[int, …]): The desired shape of the output tensor.

Returns: Tensor: The reshaped tensor.

Example: >>> a = Tensor([1, 2, 3, 4, 5, 6]) >>> result = reshape(a, (2, 3)) >>> print(result) Tensor([[1, 2, 3], [4, 5, 6]])


source

Reshape

 Reshape (shape:Tuple[int,...])

Tensor operation class that reshapes a tensor.

Example: >>> a = Tensor([1, 2, 3, 4, 5, 6]) >>> op = Reshape((2, 3)) >>> result = op.compute(a) >>> print(result) Tensor([[1, 2, 3], [4, 5, 6]])

Matrix Multiplication

Matrix multiplication, often denoted by “matmul” in some programming languages, refers to the process of multiplying two matrices together. However, in the context of calculus, it’s more common to talk about the derivative of a function.

When dealing with matrices, instead of talking about derivatives, we often discuss the Jacobian, which is a matrix of partial derivatives. If you have a function that takes a matrix as input and produces a scalar output, you could compute a gradient, which would be a matrix of the same shape as the input matrix.

However, in the context of deep learning and backpropagation, you might be asking about the derivative of a matrix multiplication operation with respect to its inputs. This is often needed when you’re training a neural network, because you need to compute gradients to update the weights.

Let’s denote the matrices as A and B, where A is a matrix of dimension m x n and B is a matrix of dimension n x p, and the result of the multiplication C = A * B is a matrix of dimension m x p.

If we are to compute the derivative of C with respect to A (i.e., ∂C/∂A), each element in A affects all elements in its corresponding row in C.

Similarly, if we are to compute the derivative of C with respect to B (i.e., ∂C/∂B), each element in B affects all elements in its corresponding column in C.

In actual computation, if we have a scalar-valued loss function L, we would compute the gradient of L with respect to A (denoted as ∂L/∂A), which is the same shape as A. To compute this, we need to know the gradient of L with respect to C (denoted as ∂L/∂C), then:

∂L/∂A = (∂L/∂C) * B^T (where * denotes matrix multiplication and B^T is the transpose of B)

Similarly, to compute the gradient of L with respect to B (denoted as ∂L/∂B):

∂L/∂B = A^T * (∂L/∂C)

The line axes_to_sum_over = tuple(range(len(out_shape) - len(lhs_shape))) is calculating which axes (dimensions) of the output gradient tensor (out_grad) need to be summed over when computing the gradient with respect to the left-hand side (a) input tensor.

This is necessary when the rank (number of dimensions) of out_grad is larger than the rank of a. This can happen, for instance, when a is a matrix (2D tensor) and out_grad is a 3D tensor (which can result from batched matrix multiplication).

The range function generates a sequence of integers from 0 up to (but not including) len(out_shape) - len(lhs_shape). The tuple function then takes this sequence and turns it into a tuple. The result is a tuple of integers representing the axes to sum over.

Here is a concrete example:

Suppose we have a batched matrix multiplication where A is a matrix of shape (m, n), and out_grad is a 3D tensor of shape (b, m, n), where b is the batch size.

In this case, len(out_shape) - len(a_shape) equals 1, so range(len(out_shape) - len(lhs_shape)) generates a sequence of integers from 0 to 1 (not inclusive), which is just [0].

So axes_to_sum_over will be (0,), indicating that we need to sum over the first axis (the batch axis) of out_grad when computing the gradient with respect to A.

This summing operation effectively accumulates the individual gradients for each item in the batch into a single gradient for the A matrix.

# Suppose we have the following shapes for `lhs` and `out_grad`
m, n, b = 5, 7, 3

# Let's create some tensors with these shapes
A = torch.randn(m, n)          # lhs is a 2D tensor (matrix) of shape (m, n)
out_grad = torch.randn(b, m, n)  # out_grad is a 3D tensor of shape (b, m, n)

# Let's say `rhs` is another matrix that was involved in computing out_grad
B = torch.randn(n, m)
out_shape, A_shape, B_shape = out_grad.shape, A.shape, B.shape
out_shape, A_shape, B_shape
(torch.Size([3, 5, 7]), torch.Size([5, 7]), torch.Size([7, 5]))
len(out_shape), len(A_shape)
(3, 2)
rng = range(len(out_shape) - len(A_shape))
rng
range(0, 1)
tuple(rng)
(0,)
axes_to_sum_over = tuple(range(len(out_shape) - len(A_shape)))
axes_to_sum_over
(0,)
torch.sum(out_grad @ B, axes_to_sum_over)
tensor([[-0.1309, -1.9203, -4.4179,  2.8422, -0.4453],
        [-1.5883, -8.1020, -6.7316, -1.3045,  0.6170],
        [-0.5317,  2.3444,  1.6038, -3.5786, -0.1689],
        [ 1.0831, -1.3743,  0.8485, -3.0593,  2.2023],
        [ 0.3071,  1.8321, -3.6827, -9.4409, -1.1884]])

source

matmul

 matmul (a:minima.autograd.Tensor, b:minima.autograd.Tensor)

Perform matrix multiplication on two tensors.

Args: a (Tensor): The first input tensor. b (Tensor): The second input tensor.

Returns: Tensor: The product of a and b.

Example: >>> a = Tensor([[1, 2], [3, 4]]) >>> b = Tensor([[5, 6], [7, 8]]) >>> result = matmul(a, b) >>> print(result) Tensor([[19, 22], [43, 50]])


source

MatMul

 MatMul ()

Tensor operation class that performs matrix multiplication.

Example: >>> a = Tensor([[1, 2], [3, 4]]) >>> b = Tensor([[5, 6], [7, 8]]) >>> op = MatMul() >>> result = op.compute(a, b) >>> print(result) Tensor([[19, 22], [43, 50]])

Summation

The Summation operation, when provided with the axes argument, sums over these axes and thereby reduces the rank of the tensor by the number of axes summed over. The backward pass needs to take this into account, as it needs to return a gradient tensor of the same shape as the input.

The forward pass (compute method) is straightforward - it just computes the sum over the specified axes.

In the backward pass (gradient method), the goal is to compute the gradient of the sum operation. Since every element of the input tensor contributes equally to the sum, the derivative of the sum with respect to each element is 1. However, since the sum operation may reduce the dimensionality of the tensor (when axes is not None), we need to account for this when computing the gradient.

To do this, we first create a new shape, where the dimensions specified by axes are replaced by 1. We then reshape out_grad to this new shape. This essentially “undoes” the dimensionality reduction performed by the sum operation. Finally, we use broadcast_to to make the reshaped gradient tensor the same shape as the input tensor.

Suppose you have the following tensor in PyTorch:

# 3x3 tensor
x = torch.tensor([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]], requires_grad=True)

# Sum over axis 0
y = x.sum(axis=0)
y
tensor([12., 15., 18.], grad_fn=<SumBackward1>)

y is now a 1-dimensional tensor of shape (3,), because we’ve summed over axis 0. If we compute the gradient of y with respect to x, we’ll want the resulting gradient tensor to have the same shape as x, which is (3,3). However, the gradient tensor we receive during backpropagation (out_grad) will have the same shape as y, which is (3,).

So we need to “undo” the dimensionality reduction by reshaping and broadcasting out_grad to match the shape of x. Here’s how you can do it in PyTorch:

# Mock out_grad tensor
out_grad = torch.tensor([1., 1., 1.])

# Reshape out_grad to have an additional dimension
reshaped_grad = out_grad.reshape(3, 1)

# Broadcast the reshaped_grad to match the input shape
broadcasted_grad = reshaped_grad.expand_as(x)

print(broadcasted_grad)
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])

Now broadcasted_grad has the same shape as x, so it can be correctly used as the gradient of x in further computations. This manual operation simulates what the gradient function of the Summation operation is doing in your original code.


source

summation

 summation (a:minima.autograd.Tensor, axes:Optional[tuple]=None)

Computes the sum of a along the specified axes.

Args: - a: The input tensor. - axes (tuple, optional): The dimensions to reduce. If None (default), reduces all dimensions.

Returns: The sum of a along the specified axes.


source

Summation

 Summation (axes:Optional[tuple]=None)

Op to compute the sum of a tensor along specified axes.

Example: >>> a = Tensor([[1, 2, 3], [4, 5, 6]]) >>> op = Summation(axes=(0,)) >>> result = op.compute(a) >>> print(result) Tensor([5, 7, 9])

Args: - axes (tuple, optional): The dimensions to reduce. If None (default), reduces all dimensions.

Methods: - compute(a: NDArray) -> NDArray: Computes the sum of a along the specified axes. - gradient(out_grad: Tensor, node: Tensor) -> Tuple[Tensor]: Computes the gradient of the sum operation.

Broadcast

# First, we create a tensor a, and set requires_grad = True so that we can compute gradients with respect to it
a = torch.tensor([1., 2., 3.], requires_grad=True)

# Now, let's define a function that performs the broadcasting operation
def broadcast_to(input, shape):
    return input.expand(shape)

# We broadcast a to a larger shape
shape = (3, 3)
b = broadcast_to(a, shape)
b
tensor([[1., 2., 3.],
        [1., 2., 3.],
        [1., 2., 3.]], grad_fn=<ExpandBackward0>)
b.shape
torch.Size([3, 3])
# Then, we define an output tensor as the sum of elements in b
# This is a simple function that we can differentiate, and will result in a gradient for b
out = b.sum()

# Compute gradients
out.backward()
a.grad
tensor([3., 3., 3.])
# Define the output gradient tensor
out_grad = torch.tensor([[1., 2., 3.], [1., 2., 3.], [1., 2., 3.]])
out_grad.shape
torch.Size([3, 3])
a_shape = a.shape
a_shape
torch.Size([3])
shape = [1] * (len((3,3)) - len((3,3))) + list(a_shape)
shape
[3]
# The gradient for the broadcast operation is the sum of out_grad over the dimension that was broadcasted
grad_a = out_grad.sum(dim=0)

print(grad_a)
tensor([3., 6., 9.])

source

broadcast_to

 broadcast_to (a:minima.autograd.Tensor, shape:Tuple[int,...])

Broadcasts a to the specified shape.

Args: - a: The input tensor. - shape: The new shape to broadcast the input tensor to.

Returns: The tensor a broadcasted to the specified shape.


source

BroadcastTo

 BroadcastTo (shape)

Op to broadcast a tensor to a new shape.

Example: >>> a = Tensor([1, 2, 3]) >>> op = BroadcastTo((3, 3)) >>> result = op.compute(a) >>> print(result) Tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3]])

Args: - shape (tuple): The new shape to broadcast the input tensor to.

Methods: - compute(a: NDArray) -> NDArray: Broadcasts a to the specified shape. - gradient(out_grad: Tensor, node: Tensor) -> Tuple[Tensor]: Computes the gradient of the broadcast operation.

br = BroadcastTo((5,2,3))
a = Tensor([[1., 2., 3.], [1., 2., 3.]])
a.shape
(2, 3)
a_br = br.compute(a)
a_br.shape
(5, 2, 3)
out_grad = Tensor(numpy.ones_like(a_br))
out_grad.shape
(5, 2, 3)
out_grad
minima.Tensor(
[[[1 1 1]
  [1 1 1]]

 [[1 1 1]
  [1 1 1]]

 [[1 1 1]
  [1 1 1]]

 [[1 1 1]
  [1 1 1]]

 [[1 1 1]
  [1 1 1]]])
a_shape = a.shape
a_shape
(2, 3)
shape = [1] * (len(br.shape) - len(a_shape)) + list(a_shape)
br.shape, shape
((5, 2, 3), [1, 2, 3])
sum_over = tuple([idx for idx in range(len(br.shape)) if br.shape[idx] != shape[idx]])
sum_over
(0,)
reshape(summation(out_grad, sum_over), a_shape).shape
(2, 3)

LogSumExp


source

logsumexp

 logsumexp (a, axes=None)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Methods
  else: warn(msg)

source

LogSumExp

 LogSumExp (axes:Optional[tuple]=None)

A Tensor operation class for performing LogSumExp computation.

Export

import nbdev; nbdev.nbdev_export()