= Tensor([1, 2, 3])
a = Tensor([4, 5, 6]) b
operators
operators
module in this framework provides a collection of tensor operations for building computational graphs in deep learning. Each class in this module represents a different type of operation that can be performed on tensors, such as element-wise addition, scalar multiplication, division, exponentiation, etc.
Note about the out_grad
parameter
During backpropagation in a neural network, we compute gradients starting from the output layer and propagate them back towards the input layer. The key idea here is that each layer receives the gradient of the loss with respect to its output (let’s call this out_grad
), and it needs to compute and pass back the gradient of the loss with respect to its input (let’s call this in_grad
). This is needed so that the parameters of each layer can be updated correctly during gradient descent.
The out_grad
parameter refers to the gradient of the loss function with respect to the output of the node. Multiplying this with the local gradient gives the gradient of the loss with respect to the input to the node, according to the chain rule of calculus, which is the basis for backpropagation in neural networks.
The chain rule is a fundamental concept in calculus that provides a method to compute the derivative of composite functions. In simple terms, the chain rule states that the derivative of a composite function is the derivative of the outer function multiplied by the derivative of the inner function.
Given a composite function that is the composition of two functions, say, \(f(g(x))\), the chain rule can be stated as follows:
\[\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}\]
Where:
- \(\frac{df}{dx}\) is the derivative of the composite function \(f(g(x))\) with respect to \(x\),
- \(\frac{df}{dg}\) is the derivative of the outer function \(f\) with respect to its argument \(g(x)\), and
- \(\frac{dg}{dx}\) is the derivative of the inner function \(g(x)\) with respect to \(x\).
The chain rule can be extended to the case where we have more than two composite functions.
Element Wise Addition
Let’s walk through the step-by-step derivative calculation for the EWiseAdd
operation:
We have the function f(a, b) = a + b
, where a
and b
are tensors. Our goal is to compute the partial derivatives with respect to a
and b
.
Let’s start by calculating the derivative of f
with respect to a
, denoted as df/da
:
Step 1: Compute the derivative of f
with respect to a
.
\(\frac{{\partial f}}{{\partial a}} = \frac{{\partial}}{{\partial a}} (a + b)\)
Since a
is the variable we are differentiating with respect to, the derivative of a
with respect to itself is 1:
\[\frac{{\partial f}}{{\partial a}} = 1\]
Therefore, \[\frac{{\partial f}}{{\partial a}} = 1.\]
Step 2: Compute the derivative of f
with respect to b
.
\[\frac{{\partial f}}{{\partial b}} = \frac{{\partial}}{{\partial b}} (a + b)\]
Again, since b
is the variable we are differentiating with respect to, the derivative of b
with respect to itself is 1:
\[\frac{{\partial f}}{{\partial b}} = 1\]
Therefore, \[\frac{{\partial f}}{{\partial b}} = 1\]
Hence, the partial derivatives of f(a, b) = a + b
with respect to a
and b
are both equal to 1.
add
add (a:minima.autograd.Tensor, b:minima.autograd.Tensor)
Adds two tensors element-wise.
Args: - a: The first tensor. - b: The second tensor.
Returns: The element-wise sum of a and b.
EWiseAdd
EWiseAdd ()
Performs element-wise addition of two tensors.
Example: >>> a = Tensor([1, 2, 3]) >>> b = Tensor([4, 5, 6]) >>> op = EWiseAdd() >>> result = op.compute(a, b) >>> print(result) Tensor([5, 7, 9])
Create two 1-D tensors
Create an EWiseAdd operation instance
= EWiseAdd() op
Compute the element-wise sum of a and b
= op.compute(a, b)
result result
minima.Tensor(
[5 7 9])
Alternatively, you can use the add function directly
= add(a, b)
result result
minima.Tensor(
[5 7 9])
or
op(a,b)
minima.Tensor(
[5 7 9])
For 2-D tensors, we can compute the element-wise sum of a and b in the same way
= Tensor([[1, 2, 3], [4, 5, 6]])
a = Tensor([[7, 8, 9], [10, 11, 12]])
b
= op.compute(a, b)
result result
minima.Tensor(
[[ 8 10 12]
[14 16 18]])
Scalar Addition
Explanation for the derivative of the AddScalar
operator:
Let’s denote the scalar as c
and a
as the tensor being added by the scalar. The operation can be described as f(a) = a + c
.
The function for the backward pass (i.e., the gradient) is df/da = 1
, which means the derivative of f(a)
with respect to a
is simply 1
.
We are given a function \(f(a) = a + c\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
By differentiating the function \(f(a)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a + c) \\ &= 1 \end{align*}\]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(1\).
We starts by defining the function f(a) = a + c
. It then explains that when we differentiate f(a)
with respect to a
, we find that the derivative is 1
. This means that the gradient of f(a)
with respect to a
is 1
, which matches the behavior of the AddScalar
operator as provided in the gradient
method.
add_scalar
add_scalar (a:minima.autograd.Tensor, scalar:Union[int,float])
Adds a scalar to a tensor.
Args: - a: The tensor. - scalar: The scalar to add.
Returns: The sum of a and the scalar.
AddScalar
AddScalar (scalar:Union[int,float])
Performs addition of a tensor and a scalar.
Example: >>> a = Tensor([1, 2, 3]) >>> op = AddScalar(5) >>> result = op.compute(a) >>> print(result) Tensor([6, 7, 8])
Element Wise Multiplication
Explanation for the derivative of the EWiseMul
(element-wise multiplication) operator:
Let’s denote the two input tensors as a
and b
. The operation can be described as f(a, b) = a * b
, where *
represents element-wise multiplication.
The function for the backward pass (i.e., the gradient) is df/da = b
and df/db = a
. This means that the derivative of f(a, b)
with respect to a
is b
, and the derivative with respect to b
is a
.
We are given a function \(f(a, b) = a \odot b\), where \(a\) and \(b\) are tensors, and \(\odot\) represents element-wise multiplication. Our task is to find the derivatives of this function with respect to \(a\) and \(b\).
By differentiating the function \(f(a, b)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a \odot b) \\ &= b \end{align*}\]
Therefore, the gradient of \(f(a, b)\) with respect to \(a\) is \(b\).
Similarly, by differentiating the function \(f(a, b)\) with respect to \(b\), we find:
\[\begin{align*} \frac{df}{db} &= \frac{d}{db} (a \odot b) \\ &= a \end{align*}\]
Therefore, the gradient of \(f(a, b)\) with respect to \(b\) is \(a\).
multiply
multiply (a:minima.autograd.Tensor, b:minima.autograd.Tensor)
Multiplies two tensors element-wise.
Args: - a: The first tensor. - b: The second tensor.
Returns: The element-wise product of a and b.
EWiseMul
EWiseMul ()
Performs element-wise multiplication of two tensors.
Example: >>> a = Tensor([1, 2, 3]) >>> b = Tensor([4, 5, 6]) >>> op = EWiseMul() >>> result = op.compute(a, b) >>> print(result) Tensor([4, 10, 18])
Scalar Multiplication
Let’s denote the scalar as c
and a
as the tensor being multiplied by the scalar. The operation can be described as f(a) = a * c
.
The function for the backward pass (i.e., the gradient) is df/da = c
, which means the derivative of f(a)
with respect to a
is c
.
The LaTeX document will look as follows:
We are given a function \(f(a) = a \cdot c\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
By differentiating the function \(f(a)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a \cdot c) \\ &= c \end{align*}\]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(c\).
We starts by defining the function f(a) = a * c
. It then explains that when we differentiate f(a)
with respect to a
, we find that the derivative is c
. This means that the gradient of f(a)
with respect to a
is c
, which matches the behavior of the MulScalar
operator as provided in the gradient
method.
mul_scalar
mul_scalar (a:minima.autograd.Tensor, scalar:Union[int,float])
Multiplies a tensor by a scalar.
Args: - a: The tensor. - scalar: The scalar to multiply.
Returns: The product of a and the scalar.
MulScalar
MulScalar (scalar:Union[int,float])
Performs multiplication of a tensor and a scalar.
Example: >>> a = Tensor([1, 2, 3]) >>> op = MulScalar(5) >>> result = op.compute(a) >>> print(result) Tensor([5, 10, 15])
Element Wise Divide
The operation described here is an element-wise division of two tensors, a
and b
, where the operation can be described as f(a, b) = a / b
.
We’ll compute the partial derivatives with respect to a
and b
:
The partial derivative of
f(a, b)
with respect toa
(df/da
) is1/b
.The partial derivative of
f(a, b)
with respect tob
(df/db
) is-a / b^2
.
We are given a function \(f(a, b) = \frac{a}{b}\), where \(a\) and \(b\) are tensors. Our task is to find the partial derivatives of this function with respect to \(a\) and \(b\).
Let’s start with \(\frac{\partial f}{\partial a}\):
\[\begin{align*} \frac{\partial f}{\partial a} &= \frac{\partial}{\partial a} \left(\frac{a}{b}\right) \\ &= \frac{1}{b} \end{align*}\]
Now, let’s compute \(\frac{\partial f}{\partial b}\):
\[\begin{align*} \frac{\partial f}{\partial b} &= \frac{\partial}{\partial b} \left(\frac{a}{b}\right) \\ &= - \frac{a}{b^{2}} \end{align*}\]
Here is a detailed derivative:
Given a function of the form \(y = \frac{u}{v}\), where both \(u\) and \(v\) are functions of \(x\), the quotient rule of differentiation states:
\[\frac{dy}{dx} = \frac{v \cdot \frac{du}{dx} - u \cdot \frac{dv}{dx}}{v^2}\]
In our case, we’re looking at the function \(y = \frac{a}{b}\), where \(a\) and \(b\) are tensors. We want to find the derivative with respect to \(b\) (instead of \(x\) in our general formula). So we have:
\[\frac{dy}{db} = \frac{b \cdot \frac{da}{db} - a \cdot \frac{db}{db}}{b^2}\]
Since \(a\) does not depend on \(b\), \(\frac{da}{db} = 0\), and since any variable is equal to itself, \(\frac{db}{db} = 1\).
So the derivative \(\frac{dy}{db}\) simplifies to:
\[\frac{dy}{db} = \frac{b \cdot 0 - a \cdot 1}{b^2}\]
Therefore, the derivative of \(y\) with respect to \(b\) is \(-\frac{a}{b^2}\).
Therefore, the gradient of \(f(a, b)\) with respect to \(a\) is \(\frac{1}{b}\), and the gradient of \(f(a, b)\) with respect to \(b\) is \(- \frac{a}{b^{2}}\).
divide
divide (a:minima.autograd.Tensor, b:minima.autograd.Tensor)
Divides two tensors element-wise.
Args: a (Tensor): The dividend tensor. b (Tensor): The divisor tensor.
Returns: Tensor: The resulting tensor after element-wise division.
Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> b = Tensor(np.array([4, 5, 6])) >>> result = divide(a, b) >>> print(result) Tensor([0.25, 0.4, 0.5])
EWiseDiv
EWiseDiv ()
The EWiseDiv operation divides two tensors element-wise.
Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> b = Tensor(np.array([4, 5, 6])) >>> div = EWiseDiv() >>> result = div.compute(a.data, b.data) >>> print(result) array([0.25, 0.4, 0.5])
Scalar Division
Let’s denote the scalar as c
, and a
as the tensor being divided by the scalar. The operation can be described as f(a) = a / c
.
The function for the backward pass (i.e., the gradient) is df/da = 1/c
.
This is the derivative of f(a)
with respect to a
.
We are given a function \(f(a) = \frac{a}{c}\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
By using the power rule of differentiation, where the derivative of \(a^n\) is \(n \cdot a^{n-1}\), we can rewrite \(f(a)\) as \(f(a) = c^{-1}a\).
Now, we can differentiate this with respect to \(a\):
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (c^{-1}a) \\ &= c^{-1} \frac{d}{da} (a) \\ &= c^{-1} \\ &= \frac{1}{c} \end{align*}\]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(\frac{1}{c}\).
divide_scalar
divide_scalar (a:minima.autograd.Tensor, scalar:Union[int,float])
Divides a tensor by a scalar.
Args: a (Tensor): The tensor to divide. scalar (int, float): The scalar to divide the tensor by.
Returns: Tensor: The resulting tensor after division.
Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> scalar = 2 >>> result = divide_scalar(a, scalar) >>> print(result) Tensor([0.5, 1.0, 1.5])
DivScalar
DivScalar (scalar:Union[int,float])
The DivScalar operation divides a tensor by a scalar.
Example: >>> import numpy as np >>> a = Tensor(np.array([1, 2, 3])) >>> scalar = 2 >>> div_scalar = DivScalar(scalar) >>> result = div_scalar.compute(a.data) >>> print(result) array([0.5, 1.0, 1.5])
Negation
Let’s denote a
as the tensor being negated. The operation can be described as f(a) = -a
.
The function for the backward pass (i.e., the gradient) is df/da = -1
.
We are given a function \(f(a) = -a\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).
By differentiating the function \(f(a)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (-a) \\ &= -1 \end{align*}\]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(-1\).
negate
negate (a:minima.autograd.Tensor)
Negates the given tensor.
Args: - a: The tensor to negate.
Returns: The negation of a.
Example: >>> a = Tensor([1, -2, 3]) >>> result = negate(a) >>> print(result) Tensor([-1, 2, -3])
Negate
Negate ()
Negates the given tensor.
Example: >>> a = Tensor([1, -2, 3]) >>> op = Negate() >>> result = op.compute(a) >>> print(result) Tensor([-1, 2, -3])
Exp
Explanation for the derivative of the Exp
operator:
Let’s denote a
as the tensor on which the exponential function is applied. The operation can be described as f(a) = exp(a)
, where exp
represents the exponential function.
The function for the backward pass (i.e., the gradient) is df/da = exp(a)
.
We are given a function \(f(a) = \exp(a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).
By differentiating the function \(f(a)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (\exp(a)) \\ &= \exp(a) \end{align*}\]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(\exp(a)\).
exp
exp (a:minima.autograd.Tensor)
Calculates the exponential of the given tensor.
Args: - a: The tensor.
Returns: The exponential of a.
Example: >>> a = Tensor([1, 2, 3]) >>> result = exp(a) >>> print(result) Tensor([2.71828183, 7.3890561, 20.08553692])
Exp
Exp ()
Calculates the exponential of the given tensor.
Example: >>> a = Tensor([1, 2, 3]) >>> op = Exp() >>> result = op.compute(a) >>> print(result) Tensor([2.71828183, 7.3890561, 20.08553692])
ReLU
The derivative of the ReLU
(Rectified Linear Unit) operator:
Let’s denote a
as the tensor on which the ReLU function is applied. The ReLU function is defined as follows:
\[ f(a) = \begin{cases} a, & \text{if } a \geq 0 \\ 0, & \text{if } a < 0 \end{cases} \]
The function for the backward pass (i.e., the gradient) is df/da = 1
if a >= 0
, and df/da = 0
if a < 0
.
We are given a function \(f(a) = \max(0, a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).
By considering the definition of the ReLU function, we can write \(f(a)\) as:
\[ f(a) = \begin{cases} a, & \text{if } a \geq 0 \\ 0, & \text{if } a < 0 \end{cases} \]
Now, let’s differentiate \(f(a)\) with respect to \(a\):
\[ \frac{df}{da} = \begin{cases} 1, & \text{if } a \geq 0 \\ 0, & \text{if } a < 0 \end{cases} \]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(1\) if \(a \geq 0\), and \(0\) if \(a < 0\).
relu
relu (a:minima.autograd.Tensor)
Applies the ReLU (Rectified Linear Unit) activation function to the given tensor.
Args: - a: The tensor.
Returns: The result of applying ReLU to a.
Example: >>> a = Tensor([1, -2, 3]) >>> result = relu(a) >>> print(result) Tensor([1, 0, 3])
ReLU
ReLU ()
Applies the ReLU (Rectified Linear Unit) activation function to the given tensor.
Example: >>> a = Tensor([1, -2, 3]) >>> op = ReLU() >>> result = op.compute(a) >>> print(result) Tensor([1, 0, 3])
Power Scalar
The derivative of the PowerScalar
operator:
Let’s denote the scalar as n
and a
as the tensor being raised to the power of the scalar. The operation can be described as f(a) = a^n
.
The function for the backward pass (i.e., the gradient) is df/da = n * a^(n-1)
.
We are given a function \(f(a) = a^n\), where \(a\) is a tensor and \(n\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
By differentiating the function \(f(a)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a^n) \\ &= n \cdot a^{n-1} \end{align*}\]
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(n \cdot a^{n-1}\).
power_scalar
power_scalar (a:minima.autograd.Tensor, scalar:int)
Raises a tensor to a power.
Args: a (Tensor): The input tensor. scalar (int): The power to raise the tensor to.
Returns: Tensor: The resulting tensor after the power operation.
Example: >>> import numpy as np >>> tensor = Tensor(np.array([1, 2, 3])) >>> result = power_scalar(tensor, 2) >>> print(result) Tensor([1, 4, 9])
PowerScalar
PowerScalar (scalar:int)
The PowerScalar operation raises a tensor to an (integer) power.
Attributes: scalar (int): The power to raise the tensor to.
Example: >>> import numpy as np >>> tensor = Tensor(np.array([1, 2, 3])) >>> pow_scalar = PowerScalar(2) >>> result = pow_scalar.compute(tensor.data) >>> print(result) array([1, 4, 9])
Log
Explanation for the derivative of the Log
operator:
Let’s denote a
as the tensor on which the logarithm is applied. The operation can be described as f(a) = log(a)
, where log
represents the natural logarithm.
The function for the backward pass (i.e., the gradient) is df/da = 1/a
.
We are given a function \(f(a) = \log(a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).
By differentiating the function \(f(a)\) with respect to \(a\), we find:
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (\log(a)) \\ &= \frac{1}{a} \end{align*}\]
We started by defining the function f(a) = log(a)
, where log
represents the natural logarithm. It then explains that when we differentiate f(a)
with respect to a
, we find that the derivative is 1/a
. This means that the gradient of f(a)
with respect to a
is 1/a
, which represents the behavior of the Log
operator.
class Log(TensorOp):
"""
The Log operation applies the natural logarithm element-wise on the tensor.
Example:
>>> import numpy as np
>>> a = Tensor(np.array([1.0, 2.0, 3.0]))
>>> log_op = Log()
>>> result = log_op.compute(a.data)
>>> print(result)
array([0., 0.69314718, 1.09861229])
"""
def compute(self, a: NDArray) -> NDArray:
"""
Applies the natural logarithm to the tensor.
Args:
a (NDArray): The input tensor.
Returns:
NDArray: The resulting tensor after applying the natural logarithm.
"""
return ARRAY_API.log(a)
def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor, ...]:
"""
Computes the gradient of the log operation.
Args:
out_grad (Tensor): The gradient of the output tensor.
node (Tensor): The node in the computational graph where the operation was performed.
Returns:
Tuple[Tensor, ...]: The gradient with respect to the input tensor.
"""
= node.children[0]
a return (out_grad / a, )
def log(a: Tensor) -> Tensor:
"""
Applies the natural logarithm to the tensor.
Args:
a (Tensor): The input tensor.
Returns:
Tensor: The resulting tensor after applying the natural logarithm.
Example:
>>> import numpy as np
>>> a = Tensor(np.array([1.0, 2.0, 3.0]))
>>> result = log(a)
>>> print(result)
Tensor([0., 0.69314718, 1.09861229])
"""
return Log()(a)
Transpose
This operation described here is the derivative of a transposition operation. Let’s define our transposition operation as a function f
such that f(a) = a^T
where a
is a tensor, and a^T
is the transpose of tensor a
.
The goal here is to compute the derivative of this operation with respect to a
. It’s important to note that transposition operation doesn’t change the values of the tensor’s elements, but it just rearranges their positions. This implies that the gradient (derivative) of a transposed tensor is simply the transposed gradient of the original tensor.
Let’s denote the gradient of the transposed tensor as g
, which can be mathematically represented as g = df/da
, where df/da
is the derivative of f(a)
with respect to a
.
Given this understanding, we can make an important conclusion:
- The derivative of
f(a)
with respect toa
isdf/da = g^T
, meaning that the derivative of the transposed tensor is simply the transposed gradient of the original tensor.
This concept can be written in mathematical terms using LaTeX as follows:
We have a function \(f(a) = a^T\), where \(a\) is a tensor and \(a^T\) is its transpose. We want to find the derivative of this function with respect to \(a\), that is, compute \(\frac{df}{da}\).
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (a^T) \\ &= (g)^T \end{align*}\]
In the equation above, \(g\) is the gradient of the transposed tensor. This equation indicates that the derivative of the transpose of a tensor is the transpose of the gradient of the original tensor.
Now, if we consider a Python class Transpose
that implements this transposition operation, we would have a gradient
method in the class that computes the derivative of the transpose operation. This method would apply the transpose function to out_grad
, which represents the gradient of the output tensor, thereby giving us the transposed gradient of the original tensor. In the code, transpose(out_grad, axes=self.axes)
performs the transposition of out_grad
along the same axes that were used in the forward pass. Thus, the gradient of the transposition operation with respect to the input tensor a
is computed as the transpose of the output gradient out_grad
.
transpose
transpose (a:minima.autograd.Tensor, axes:Optional[tuple]=None)
Perform the transpose operation on the input tensor along the specified axes. If no axes are specified, it swaps the last two dimensions of the input tensor.
Args: a (Tensor): The input tensor. axes (Optional[tuple]): The pair of axes that should be swapped. If not provided, the last two axes are swapped.
Returns: Tensor: The transposed tensor.
Example: >>> a = Tensor(np.arange(1, 7).reshape(2, 3)) >>> result = transpose(a) >>> print(result) Tensor([[1, 4], [2, 5], [3, 6]])
Transpose
Transpose (axes:Optional[tuple]=None)
Tensor operation class that performs transposition of a tensor along specified axes.
If no axes are specified, it swaps the last two dimensions of the input tensor.
Example: >>> a = Tensor(np.arange(1, 7).reshape(2, 3)) >>> op = Transpose() >>> result = op.compute(a.data) >>> print(result) array([[1, 4], [2, 5], [3, 6]])
Reshape
The operation described here is a reshaping of a tensor a
, where the operation can be described as f(a) = reshape(a, new_shape)
.
We’ll compute the derivative of this operation.
The reshaping operation doesn’t change the values of the tensor elements but only rearranges them. This means that the gradient of a reshaped tensor is just the reshaped gradient of the original tensor.
Let’s denote the gradient of the reshaped tensor as g = df/da
, where f(a) = reshape(a, new_shape)
.
Given this, we can derive the following:
- The derivative of
f(a)
with respect toa
isdf/da = reshape(g, original_shape)
.
This conclusion can be illustrated as follows in Latex:
We are given a function \(f(a) = reshape(a, new\_shape)\), where \(a\) is a tensor and reshape(a, new_shape)
is the reshaped tensor. Our task is to find the derivative of this function with respect to \(a\).
Let’s compute \(\frac{df}{da}\):
\[\begin{align*} \frac{df}{da} &= \frac{d}{da} (reshape(a, new\_shape)) \\ &= reshape(g, original\_shape) \end{align*}\]
Here, \(g\) is the gradient of the reshaped tensor. The derivative of a reshaped tensor is the reshaped derivative of the original tensor. The reshaped derivative has the same shape as the original tensor.
Now, let’s apply this to the Reshape
class.
The gradient
method in the Reshape
class computes the gradient of the reshape operation. The gradient of the reshaped tensor is just the reshaped gradient of the original tensor. This is implemented by applying the reshape
function to out_grad
, which is the gradient of the output tensor, and then returning this reshaped gradient. The shape used for the reshaping is the shape of the original tensor, which is obtained from node.children[0].shape
.
Therefore, the gradient of the reshape operation with respect to the input tensor a
is the reshaping of the output gradient out_grad
to the shape of the original tensor.
reshape
reshape (a:minima.autograd.Tensor, shape:Tuple[int,...])
Reshape the input tensor to the specified shape.
Args: a (Tensor): The input tensor. shape (Tuple[int, …]): The desired shape of the output tensor.
Returns: Tensor: The reshaped tensor.
Example: >>> a = Tensor([1, 2, 3, 4, 5, 6]) >>> result = reshape(a, (2, 3)) >>> print(result) Tensor([[1, 2, 3], [4, 5, 6]])
Reshape
Reshape (shape:Tuple[int,...])
Tensor operation class that reshapes a tensor.
Example: >>> a = Tensor([1, 2, 3, 4, 5, 6]) >>> op = Reshape((2, 3)) >>> result = op.compute(a) >>> print(result) Tensor([[1, 2, 3], [4, 5, 6]])
Matrix Multiplication
Matrix multiplication, often denoted by “matmul” in some programming languages, refers to the process of multiplying two matrices together. However, in the context of calculus, it’s more common to talk about the derivative of a function.
When dealing with matrices, instead of talking about derivatives, we often discuss the Jacobian, which is a matrix of partial derivatives. If you have a function that takes a matrix as input and produces a scalar output, you could compute a gradient, which would be a matrix of the same shape as the input matrix.
However, in the context of deep learning and backpropagation, you might be asking about the derivative of a matrix multiplication operation with respect to its inputs. This is often needed when you’re training a neural network, because you need to compute gradients to update the weights.
Let’s denote the matrices as A
and B
, where A
is a matrix of dimension m x n
and B
is a matrix of dimension n x p
, and the result of the multiplication C = A * B
is a matrix of dimension m x p
.
If we are to compute the derivative of C
with respect to A
(i.e., ∂C/∂A), each element in A
affects all elements in its corresponding row in C
.
Similarly, if we are to compute the derivative of C
with respect to B
(i.e., ∂C/∂B), each element in B
affects all elements in its corresponding column in C
.
In actual computation, if we have a scalar-valued loss function L
, we would compute the gradient of L
with respect to A
(denoted as ∂L/∂A), which is the same shape as A
. To compute this, we need to know the gradient of L
with respect to C
(denoted as ∂L/∂C), then:
∂L/∂A = (∂L/∂C) * B^T (where * denotes matrix multiplication and B^T is the transpose of B)
Similarly, to compute the gradient of L
with respect to B
(denoted as ∂L/∂B):
∂L/∂B = A^T * (∂L/∂C)
The line axes_to_sum_over = tuple(range(len(out_shape) - len(lhs_shape)))
is calculating which axes (dimensions) of the output gradient tensor (out_grad
) need to be summed over when computing the gradient with respect to the left-hand side (a
) input tensor.
This is necessary when the rank (number of dimensions) of out_grad
is larger than the rank of a
. This can happen, for instance, when a
is a matrix (2D tensor) and out_grad
is a 3D tensor (which can result from batched matrix multiplication).
The range
function generates a sequence of integers from 0 up to (but not including) len(out_shape) - len(lhs_shape)
. The tuple
function then takes this sequence and turns it into a tuple. The result is a tuple of integers representing the axes to sum over.
Here is a concrete example:
Suppose we have a batched matrix multiplication where A
is a matrix of shape (m, n)
, and out_grad
is a 3D tensor of shape (b, m, n)
, where b
is the batch size.
In this case, len(out_shape) - len(a_shape)
equals 1
, so range(len(out_shape) - len(lhs_shape))
generates a sequence of integers from 0
to 1
(not inclusive), which is just [0]
.
So axes_to_sum_over
will be (0,)
, indicating that we need to sum over the first axis (the batch axis) of out_grad
when computing the gradient with respect to A
.
This summing operation effectively accumulates the individual gradients for each item in the batch into a single gradient for the A
matrix.
# Suppose we have the following shapes for `lhs` and `out_grad`
= 5, 7, 3
m, n, b
# Let's create some tensors with these shapes
= torch.randn(m, n) # lhs is a 2D tensor (matrix) of shape (m, n)
A = torch.randn(b, m, n) # out_grad is a 3D tensor of shape (b, m, n)
out_grad
# Let's say `rhs` is another matrix that was involved in computing out_grad
= torch.randn(n, m) B
= out_grad.shape, A.shape, B.shape
out_shape, A_shape, B_shape out_shape, A_shape, B_shape
(torch.Size([3, 5, 7]), torch.Size([5, 7]), torch.Size([7, 5]))
len(out_shape), len(A_shape)
(3, 2)
= range(len(out_shape) - len(A_shape))
rng rng
range(0, 1)
tuple(rng)
(0,)
= tuple(range(len(out_shape) - len(A_shape)))
axes_to_sum_over axes_to_sum_over
(0,)
sum(out_grad @ B, axes_to_sum_over) torch.
tensor([[-0.1309, -1.9203, -4.4179, 2.8422, -0.4453],
[-1.5883, -8.1020, -6.7316, -1.3045, 0.6170],
[-0.5317, 2.3444, 1.6038, -3.5786, -0.1689],
[ 1.0831, -1.3743, 0.8485, -3.0593, 2.2023],
[ 0.3071, 1.8321, -3.6827, -9.4409, -1.1884]])
matmul
matmul (a:minima.autograd.Tensor, b:minima.autograd.Tensor)
Perform matrix multiplication on two tensors.
Args: a (Tensor): The first input tensor. b (Tensor): The second input tensor.
Returns: Tensor: The product of a and b.
Example: >>> a = Tensor([[1, 2], [3, 4]]) >>> b = Tensor([[5, 6], [7, 8]]) >>> result = matmul(a, b) >>> print(result) Tensor([[19, 22], [43, 50]])
MatMul
MatMul ()
Tensor operation class that performs matrix multiplication.
Example: >>> a = Tensor([[1, 2], [3, 4]]) >>> b = Tensor([[5, 6], [7, 8]]) >>> op = MatMul() >>> result = op.compute(a, b) >>> print(result) Tensor([[19, 22], [43, 50]])
Summation
The Summation
operation, when provided with the axes
argument, sums over these axes and thereby reduces the rank of the tensor by the number of axes summed over. The backward pass needs to take this into account, as it needs to return a gradient tensor of the same shape as the input.
The forward pass (compute
method) is straightforward - it just computes the sum over the specified axes.
In the backward pass (gradient
method), the goal is to compute the gradient of the sum operation. Since every element of the input tensor contributes equally to the sum, the derivative of the sum with respect to each element is 1. However, since the sum operation may reduce the dimensionality of the tensor (when axes
is not None
), we need to account for this when computing the gradient.
To do this, we first create a new shape, where the dimensions specified by axes
are replaced by 1. We then reshape out_grad
to this new shape. This essentially “undoes” the dimensionality reduction performed by the sum operation. Finally, we use broadcast_to
to make the reshaped gradient tensor the same shape as the input tensor.
Suppose you have the following tensor in PyTorch:
# 3x3 tensor
= torch.tensor([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]], requires_grad=True)
x
# Sum over axis 0
= x.sum(axis=0) y
y
tensor([12., 15., 18.], grad_fn=<SumBackward1>)
y
is now a 1-dimensional tensor of shape (3,)
, because we’ve summed over axis 0. If we compute the gradient of y
with respect to x
, we’ll want the resulting gradient tensor to have the same shape as x
, which is (3,3)
. However, the gradient tensor we receive during backpropagation (out_grad
) will have the same shape as y
, which is (3,)
.
So we need to “undo” the dimensionality reduction by reshaping and broadcasting out_grad
to match the shape of x
. Here’s how you can do it in PyTorch:
# Mock out_grad tensor
= torch.tensor([1., 1., 1.])
out_grad
# Reshape out_grad to have an additional dimension
= out_grad.reshape(3, 1)
reshaped_grad
# Broadcast the reshaped_grad to match the input shape
= reshaped_grad.expand_as(x)
broadcasted_grad
print(broadcasted_grad)
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Now broadcasted_grad
has the same shape as x
, so it can be correctly used as the gradient of x
in further computations. This manual operation simulates what the gradient
function of the Summation
operation is doing in your original code.
summation
summation (a:minima.autograd.Tensor, axes:Optional[tuple]=None)
Computes the sum of a
along the specified axes.
Args: - a: The input tensor. - axes (tuple, optional): The dimensions to reduce. If None
(default), reduces all dimensions.
Returns: The sum of a
along the specified axes.
Summation
Summation (axes:Optional[tuple]=None)
Op to compute the sum of a tensor along specified axes.
Example: >>> a = Tensor([[1, 2, 3], [4, 5, 6]]) >>> op = Summation(axes=(0,)) >>> result = op.compute(a) >>> print(result) Tensor([5, 7, 9])
Args: - axes (tuple, optional): The dimensions to reduce. If None
(default), reduces all dimensions.
Methods: - compute(a: NDArray) -> NDArray: Computes the sum of a
along the specified axes. - gradient(out_grad: Tensor, node: Tensor) -> Tuple[Tensor]: Computes the gradient of the sum operation.
Broadcast
# First, we create a tensor a, and set requires_grad = True so that we can compute gradients with respect to it
= torch.tensor([1., 2., 3.], requires_grad=True)
a
# Now, let's define a function that performs the broadcasting operation
def broadcast_to(input, shape):
return input.expand(shape)
# We broadcast a to a larger shape
= (3, 3)
shape = broadcast_to(a, shape) b
b
tensor([[1., 2., 3.],
[1., 2., 3.],
[1., 2., 3.]], grad_fn=<ExpandBackward0>)
b.shape
torch.Size([3, 3])
# Then, we define an output tensor as the sum of elements in b
# This is a simple function that we can differentiate, and will result in a gradient for b
= b.sum()
out
# Compute gradients
out.backward()
a.grad
tensor([3., 3., 3.])
# Define the output gradient tensor
= torch.tensor([[1., 2., 3.], [1., 2., 3.], [1., 2., 3.]])
out_grad out_grad.shape
torch.Size([3, 3])
= a.shape
a_shape a_shape
torch.Size([3])
= [1] * (len((3,3)) - len((3,3))) + list(a_shape)
shape shape
[3]
# The gradient for the broadcast operation is the sum of out_grad over the dimension that was broadcasted
= out_grad.sum(dim=0)
grad_a
print(grad_a)
tensor([3., 6., 9.])
broadcast_to
broadcast_to (a:minima.autograd.Tensor, shape:Tuple[int,...])
Broadcasts a
to the specified shape.
Args: - a: The input tensor. - shape: The new shape to broadcast the input tensor to.
Returns: The tensor a
broadcasted to the specified shape.
BroadcastTo
BroadcastTo (shape)
Op to broadcast a tensor to a new shape.
Example: >>> a = Tensor([1, 2, 3]) >>> op = BroadcastTo((3, 3)) >>> result = op.compute(a) >>> print(result) Tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3]])
Args: - shape (tuple): The new shape to broadcast the input tensor to.
Methods: - compute(a: NDArray) -> NDArray: Broadcasts a
to the specified shape. - gradient(out_grad: Tensor, node: Tensor) -> Tuple[Tensor]: Computes the gradient of the broadcast operation.
= BroadcastTo((5,2,3))
br = Tensor([[1., 2., 3.], [1., 2., 3.]]) a
a.shape
(2, 3)
= br.compute(a)
a_br a_br.shape
(5, 2, 3)
= Tensor(numpy.ones_like(a_br))
out_grad out_grad.shape
(5, 2, 3)
out_grad
minima.Tensor(
[[[1 1 1]
[1 1 1]]
[[1 1 1]
[1 1 1]]
[[1 1 1]
[1 1 1]]
[[1 1 1]
[1 1 1]]
[[1 1 1]
[1 1 1]]])
= a.shape
a_shape a_shape
(2, 3)
= [1] * (len(br.shape) - len(a_shape)) + list(a_shape) shape
br.shape, shape
((5, 2, 3), [1, 2, 3])
= tuple([idx for idx in range(len(br.shape)) if br.shape[idx] != shape[idx]])
sum_over sum_over
(0,)
reshape(summation(out_grad, sum_over), a_shape).shape
(2, 3)
LogSumExp
logsumexp
logsumexp (a, axes=None)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Methods
else: warn(msg)
LogSumExp
LogSumExp (axes:Optional[tuple]=None)
A Tensor operation class for performing LogSumExp computation.
Export
import nbdev; nbdev.nbdev_export()