# Design Notes: Why can you change the primal computation?

These design notes are to help you understand ChainRules.jl's `rrule`

function. It explains why we have a `rrule`

function that returns both the primal result (i.e. the output for the forward pass) and the pullback as a closure. It might be surprising to some AD authors, who might expect just a function that performs the pullback, that the `rrule`

function computes the primal result as well as the pullback. In particularly, `rrule`

allows you to *change* how the primal result is computed. We will illustrate in this document why being able to change the computation of the primal is crucial for efficient AD.

Discussion here is focused on on reverse mode and `rrule`

. Similar concerns do apply to forward mode and `frule`

. In forward mode these concerns lead to the fusing of the `pushforward`

into `frule`

. All the examples given here also apply in forward mode. In fact in forward mode there are even more opportunities to take advantage of sharing work between the primal and derivative computations. A particularly notable example is in efficiently calculating the pushforward of solving a differential equation via expanding the system of equations to also include the derivatives before solving it.

## The Journey to `rrule`

Let's imagine a different system for rules, one that doesn't let you define the computation of the primal. This system is what a lot of AD systems have. It is what Nabla.jl had originally.^{[1]} We will have a primal (i.e. forward) pass that directly executes the primal function and just records the primal *function*, its *inputs* and its *output* onto the tape.^{[2]}. Then during the gradient (i.e. reverse) pass it has a function which receives those records from the tape along with the sensitivity of the output, and gives back the sensitivity of the input. We will call this function `pullback_at`

, as it pulls back the sensitivity at a given primal point. To make this concrete:

```
y = f(x) # primal program
x̄ = pullback_at(f, x, y, ȳ)
```

Let's illustrate this with examples for `sin`

and for the logistic sigmoid.

## Example for `sin`

```
y = sin(x)
pullback_at(::typeof(sin), x, y, ȳ) = ȳ * cos(x)
```

`pullback_at`

uses the primal input `x`

, and the sensitivity being pulled back `ȳ`

.

## Example for the logistic sigmoid

```
σ(x) = 1/(1 + exp(-x)) # = exp(x) / (1 + exp(x))
y = σ(x)
pullback_at(::typeof(σ), x, y, ȳ) = ȳ * y * σ(-x) # = ȳ * σ(x) * σ(-x)
```

Notice that in `pullback_at`

we are not only using input `x`

but also using the primal output `y`

. This is a nice bit of symmetry that shows up around `exp`

.

Now let's consider why we implement `rrule`

s in the first place. One key reason is to insert domain knowledge so as to compute the derivative more efficiently than AD would just by breaking everything down into `+`

, `*`

, etc.^{[3]} What insights do we have about `sin`

and `cos`

? What about using `sincos`

?

## Example for `sin`

```
julia> using BenchmarkTools
julia> @btime sin(x) setup=(x=rand());
3.838 ns (0 allocations: 0 bytes)
julia> @btime cos(x) setup=(x=rand());
4.795 ns (0 allocations: 0 bytes)
julia> 3.838 + 4.795
8.633
```

vs computing both together:

```
julia> @btime sincos(x) setup=(x=rand());
6.028 ns (0 allocations: 0 bytes)
```

What about the logistic sigmoid? We note that the two values we need are `σ(x)`

and `σ(-x)`

If we write these as: $\sigma(x) = \frac{e^x}{1+e^x}$ and $\sigma(-x) = \frac{1}{1+e^x}$ then we see they have the common term $e^x$. `exp(x)`

is a much more expensive operation than `+`

and `/`

. So we can save time, if we can reuse that `exp(x)`

.

## Example for the logistic sigmoid

If we have to computing separately:

```
julia> @btime 1/(1+exp(x)) setup=(x=rand());
5.622 ns (0 allocations: 0 bytes)
julia> @btime 1/(1+exp(-x)) setup=(x=rand());
6.036 ns (0 allocations: 0 bytes)
julia> 5.622 + 6.036
11.658
```

vs reusing `exp(x)`

:

```
julia> @btime exp(x) setup=(x=rand());
5.367 ns (0 allocations: 0 bytes)
julia> @btime ex/(1+ex) setup=(ex=exp(rand()));
1.255 ns (0 allocations: 0 bytes)
julia> @btime 1/(1+ex) setup=(ex=exp(rand()));
1.256 ns (0 allocations: 0 bytes)
julia> 5.367 + 1.255 + 1.256
7.878
```

So we are talking about a 30-40% speed-up from these optimizations.^{[4]}

It is faster to compute `sin`

and `cos`

at the same time via `sincos`

than it is to compute them one after the other. And it is faster to reuse the `exp(x)`

in computing `σ(x)`

and `σ(-x)`

. How can we incorporate this insight into our system? We know we can compute both of these in the primal — because they only depend on `x`

and not on `ȳ`

— but there is nowhere to put them that is accessible both to the primal pass and the gradient pass code.

What if we introduced some variable called `intermediates`

that is also recorded onto the tape during the primal pass? We would need to be able to modify the primal pass to do this, so that we can actually put the data into the `intermediates`

. So we will introduce a function: `augmented_primal`

, that will return the primal output plus the `intermediates`

that we want to reuse in the gradient pass. Then we will make our AD system replace calls to the primal with calls to the `augmented_primal`

of the primal function and take care of all the bookkeeping. So that would look like:

```
y = f(x) # primal program
y, intermediates = augmented_primal(f, x)
x̄ = pullback_at(f, x, y, ȳ, intermediates)
```

## Example for `sin`

```
function augmented_primal(::typeof(sin), x)
y, cx = sincos(x)
return y, (; cx=cx) # use a NamedTuple for the intermediates
end
pullback_at(::typeof(sin), x, y, ȳ, intermediates) = ȳ * intermediates.cx
```

## Example for the logistic sigmoid

```
function augmented_primal(::typeof(σ), x)
ex = exp(x)
y = ex / (1 + ex)
return y, (; ex=ex) # use a NamedTuple for the intermediates
end
pullback_at(::typeof(σ), x, y, ȳ, intermediates) = ȳ * y / (1 + intermediates.ex)
```

Cool! That lets us do what we wanted. We net decreased the time it takes to run the primal and gradient passes. We have now demonstrated the title question of why we want to be able to modify the primal pass. We will go into that more later and have some more usage examples, but first let's continue to see how we go from `augmented_primal`

and `pullback_at`

to `rrule`

.

One thing we notice when looking at `pullback_at`

is it really is starting to have a lot of arguments. It had a fair few already, and now we are adding `intermediates`

as well, making it even more unwieldy. Not to mention these are fairly simple example, the `sin`

and `σ`

functions have 1 input and no keyword arguments. Furthermore, we often don't even use all of the arguments to `pullback_at`

. The new code for pulling back `sin`

— which uses `sincos`

and `intermediates`

— no longer needs `x`

, and it never needed `y`

(though sigmoid `σ`

does). And storing all these things on the tape — inputs, outputs, sensitivities, intermediates — is using up extra memory. What if we generalized the idea of the `intermediate`

named tuple, and had `augmented_primal`

return a struct that just held anything we might want put on the tape.

```
struct PullbackMemory{P, S}
primal_function::P
state::S
end
# convenience constructor:
PullbackMemory(primal_function; state...) = PullbackMemory(primal_function, state)
# convenience accessor so that `m.x` is same as `m.state.x`
Base.getproperty(m::PullbackMemory, propname) = getproperty(getfield(m, :state), propname)
```

So changing our API we have:

```
y = f(x) # primal program
y, pb = augmented_primal(f, x)
x̄ = pullback_at(pb, ȳ)
```

which is much cleaner.

## Example for `sin`

```
function augmented_primal(::typeof(sin), x)
y, cx = sincos(x)
return y, PullbackMemory(sin; cx=cx)
end
pullback_at(pb::PullbackMemory{typeof(sin)}, ȳ) = ȳ * pb.cx
```

## Example for the logistic sigmoid

```
function augmented_primal(::typeof(σ), x)
ex = exp(x)
y = ex / (1 + ex)
return y, PullbackMemory(σ; y=y, ex=ex)
end
pullback_at(pb::PullbackMemory{typeof(σ)}, ȳ) = ȳ * pb.y / (1 + pb.ex)
```

That now looks much simpler; `pullback_at`

only ever has 2 arguments.

One way we could make it nicer to use is by making `PullbackMemory`

a callable object. Conceptually the `PullbackMemory`

is a fixed thing it the contents of the tape for a particular operation. It is fully determined by the end of the primal pass. The during the gradient (reverse) pass the `PullbackMemory`

is used to successively compute the `ȳ`

argument. So it makes sense to make `PullbackMemory`

a callable object that acts on the sensitivity. We can do that via call overloading:

```
y = f(x) # primal program
y, pb = augmented_primal(f, x)
x̄ = pb(ȳ)
```

## Example for `sin`

```
function augmented_primal(::typeof(sin), x)
y, cx = sincos(x)
return y, PullbackMemory(sin; cx=cx)
end
(pb::PullbackMemory{typeof(sin)})(ȳ) = ȳ * pb.cx
```

## Example for the logistic sigmoid

```
function augmented_primal(::typeof(σ), x)
ex = exp(x)
y = ex / (1 + ex)
return y, PullbackMemory(σ; y=y, ex=ex)
end
(pb::PullbackMemory{typeof(σ)})(ȳ) = ȳ * pb.y / (1 + pb.ex)
```

Let's recap what we have done here. We now have an object `pb`

that acts on the cotangent of the output of the primal `ȳ`

to give us the cotangent of the input of the primal function `x̄`

. `pb`

is not just the **memory** of state required for the `pullback`

, it **is** the pullback.

We have one final thing to do, which is to think about how we make the code easy to modify. Let's go back and think about the changes we would have make to go from our original way of writing that only used the inputs/outputs, to one that used the intermediate state.

## Example for `sin`

To rewrite that original formulation in the new pullback form we have:

```
function augmented_primal(::typeof(sin), x)
y = sin(x)
return y, PullbackMemory(sin; x=x)
end
(pb::PullbackMemory)(ȳ) = ȳ * cos(pb.x)
```

To go from that to:

```
function augmented_primal(::typeof(sin), x)
y, cx = sincos(x)
return y, PullbackMemory(sin; cx=cx)
end
(pb::PullbackMemory)(ȳ) = ȳ * pb.cx
```

## Example for the logistic sigmoid

```
function augmented_primal(::typeof(σ), x)
y = σ(x)
return y, PullbackMemory(σ; y=y, x=x)
end
(pb::PullbackMemory{typeof(σ)})(ȳ) = ȳ * pb.y * σ(-pb.x)
```

to get to:

```
function augmented_primal(::typeof(σ), x)
ex = exp(x)
y = ex/(1 + ex)
return y, PullbackMemory(σ; y=y, ex=ex)
end
(pb::PullbackMemory{typeof(σ)})(ȳ) = ȳ * pb.y/(1 + pb.ex)
```

We should think about how we might want to make future changes to this code.^{[6]}

We need to make a series of changes:

- update what work is done in the primal, to compute the intermediate values.
- update what is stored in the
`PullbackMemory`

. - update the function that applies the pullback so it uses the new thing that was stored.

It's important these parts all stay in sync. It's not too bad for this simple example with just one or two things to remember. For more complicated multi-argument functions, which we will show below, you often end up needing to remember half a dozen things, like sizes and indices relating to each input/output, so it gets a little more fiddly to make sure you remember all the things you need to and give them the same name in both places. *Is there a way we can automatically just have all the things we use remembered for us?* Surprisingly for such a specific request, there actually is: a closure.

A closure in Julia is a callable structure that automatically contains a field for every object from its parent scope that is used in its body. There are incredible ways to abuse this; but here we can use closures exactly as they are intended. Replacing `PullbackMemory`

with a closure that works the same way lets us avoid having to manually control what is remembered *and* lets us avoid separately writing the call overload.

## Example for `sin`

```
function augmented_primal(::typeof(sin), x)
y, cx = sincos(x)
pb = ȳ -> cx * ȳ # pullback closure. closes over `cx`
return y, pb
end
```

## Example for the logistic sigmoid

```
function augmented_primal(::typeof(σ), x)
ex = exp(x)
y = ex / (1 + ex)
pb = ȳ -> ȳ * y / (1 + ex) # pullback closure. closes over `y` and `ex`
return y, pb
end
```

This is pretty clean now.

Our `augmented_primal`

is now within spitting distance of `rrule`

. All that is left is a rename and some extra conventions around multiple outputs and gradients with respect to callable objects.

This has been a journey into how we get to `rrule`

as it is defined in `ChainRulesCore`

. We started with an unaugmented primal function and a `pullback_at`

function that only saw the inputs and outputs of the primal. We realized a key limitation of this was that we couldn't share computational work between the primal and gradient passes. To solve this we introduced the notation of some `intermediate`

that is shared from the primal to the pullback. We successively improved that idea, first by making it a type that held everything that is needed for the pullback: the `PullbackMemory`

, which we then made callable, so it was itself the pullback. Finally, we replaced that separate callable structure with a closure, which kept everything in one place and made it more convenient.

## More Shared Work Examples

`sin`

and the logistic sigmoid are nice, simple examples of when it is useful to share work between the primal and the pullback. There are many others though. It is actually surprising that in so many cases it is reasonable to write the rules where the only shared information between the primal and the pullback is the primal inputs (like our original `sin`

), or primal outputs (like our original logistic sigmoid). Under our formulation above, those primal inputs/outputs are shared information just like any other. Beyond this, there are a number of other decent applications.

`getindex`

In Julia (and many other numerical languages) indexing can take many more arguments than simply a couple of integers, such as boolean masking arrays (logical indexing), ranges for slices, etc. Converting the arguments to plain integers, arrays of integers, and ranges with `Base.to_indices`

is the first thing that `getindex`

does. It then re-calls `getindex`

with these simpler types to get the result.

The result of pulling back the `getindex`

operation is always an array that is all zeros, except for the elements that are selected, which are set to the appropriate sensitivities being pulled back. To identify which actual positions in the array are being gotten/set is common work to both primal and gradient computations. We really don't want to deal with fancy indexing types during the pullback, because there are weird edge cases like indexing in such a way that the same element is output twice (and thus we have 2 sensitivities we need to add to it). We can pull the `to_indices`

out of the primal computation and remember the plain indexes used, then can reuse them to set gradients during the pullback.

See the code for this in ChainRules.jl

`exp(::Matrix)`

Matrix Functions are generalizations of scalar functions to operate on matrices. Note that this is distinct from simply element-wise application of the function to the matrix's elements. The Matrix Exponential `exp(::Matrix)`

is a particularly important matrix function.

Al-Mohy and Higham (2009)^{[7]}, published a method for computing the pullback of `exp(::Matrix)`

. It is pretty complex and very cool. To quote its abstract (emphasis mine):

The algorithm is derived from the scaling and squaring method by differentiating the Padé approximants and the squaring recurrence,

re-using quantities computed during the evaluation of the Padé approximant, and intertwining the recurrences in the squaring phase.

Julia does in fact use a Padé approximation to compute `exp(::Matrix)`

. So we can extract the code for that into our augmented primal, and add remembering the intermediate quantities that are to be used. See the code for this in ChainRules.jl

An interesting scenario here that may be of concern to some: if Julia changes the algorithm it uses to compute `exp(::Matrix)`

, then during an AD primal pass, it will continue to use the old Padé approximation based algorithm. This may actually happen, as there are many other algorithms that can compute the matrix exponential. Further, perhaps there might be an improvement to the exact coefficient or cut-offs used by Julia's current Padé approximation. If Julia made this change it wopuld not be considered breaking. Exact floating point numerical values are not generally considered part of the SemVer-bound API. Rather only the general accuracy of the computed value relative to the true mathematical value (e.g. for common scalar operations Julia promises 1 ULP).

This change will result in the output of the AD primal pass not being exactly equal to what would be seen from just running the primal code. It will still be accurate because the current implementation is accurate, but it will be different. It is our argument that in general this should be considered acceptable, as long as the AD primal pass is in general about as accurate as the unaugmented primal. E.g. it might overshoot for some values the unaugmented primal undershoots for.

`eigvals`

`eigvals`

is a real case where the algorithm for the augmented primal and the original primal *is already different today*. To compute the pullback of `eigvals`

you need to know not only the eigenvalues but also the eigenvectors. The `eigen`

function computes both, so that is used in the augmented primal. See the code for this in ChainRules.jl. If we could not compute and remember the eigenvectors in the primal pass, we would have to call `eigen`

in the gradient pass anyway and fully recompute eigenvectors and eigenvalues, more than doubling the total work.

However, if you trace this down, it actually uses a different algorithm.

`eigvals`

basically wraps `LAPACK.syevr!('N', ...)`

, which goes through DSYEVR and eventually calls DSTERF, which uses *"Pal-Walker-Kahan variant of the QL or QR algorithm."* to compute eigenvalues

In contrast, `eigen`

wraps `LAPACK.syevr!('V',...)`

which also goes through DSYEVR but eventually calls DSTEMR, which calculates eigenvalues *"either by bisection or the dqds algorithm."*.

Both of these are very good algorithms. LAPACK has had decades of work by experts and is one of the most trusted libraries for linear algebra. But they are different algorithms that give different results. The differences in practice are around $10^{-15}$, which while very small on absolute terms are as far as `Float64`

is concerned a very real difference.

### Matrix Division

Roughly speaking: `Y=A\B`

is the function that finds the least-square solution to `YA ≈ B`

. When solving such a system, the efficient way to do so is to factorize `A`

into an appropriate factorized form such as `Cholesky`

or `QR`

, then perform the `\`

operation on the factorized form. The pullback of `A\B`

with respect to `B`

is `Ȳ -> A' \ Ȳ`

. It should be noted that this involves computing the factorization of `A'`

(the adjoint of `A`

).^{[8]} In this computation the factorization of the original `A`

can reused. Doing so can give a 4x speed-up.

We don't have this in ChainRules.jl yet, because Julia is missing some definitions of `adjoint`

of factorizations (JuliaLang/julia#38293).^{[8]} We have been promised them for Julia v1.7 though. You can see what the code would look like in PR #302.

## Conclusion

This document has explained why `rrule`

is the way it is. In particular it has highlighted why the primal computation is able to be changed from simply calling the function. Further, it has explained why `rrule`

returns a closure for the pullback, rather than it being a separate function. It has highlighted several places in ChainRules.jl where this has allowed us to significantly improve performance. Being able to change the primal computation is practically essential for a high performance AD system.

- 1I am not just picking on Nabla randomly. Many of the core developers of ChainRules worked on Nabla prior. It's a good AD, but ChainRules incorporates lessons learned from working on Nabla.
- 2which may be an explicit tape or an implicit tape that is actually incorporated into generated code (à la Zygote)
- 3Another key reason is if the operation is a primitive that is not defined in terms of more basic operations. In many languages this is the case for
`sin`

; where the actual implementation is in some separate`libm.so`

. But actually`sin`

in Julia is defined in terms of a polynomial. It's fairly vanilla Julia code. It shouldn't be too hard for an AD that only knows about basic operations like`+`

and`*`

to AD through it. In any case, that is another discussion for another day. - 4Sure, this is small fries and depending on Julia version might just get solved by the optimizer
^{[5]}, but go with it for the sake of example. - 5To be precise, this is very likely to be solved by the optimizer inlining both and then performing common subexpression elimination, with the result that it generates the code for
`sincos`

just from having`sin`

and`cos`

inside the same function. However, this actually doesn't apply in the case of AD, as it is not possible to inline code called in the gradient pass into the primal pass. Those are separate functions called at very different times. This is something opaque closures should help solve. - 6One change we might consider is to have logistic sigmoid to only remember one thing. Rather than remembering
`y`

and`ex`

to use in the pullback, we could compute`y / (1 + ex)`

during the augmented primal, and just remember that. - 7Al-Mohy, Awad H. and Higham, Nicholas J. (2009)
*Computing the Fréchet Derivative of the Matrix Exponential, with an application to Condition Number Estimation*. SIAM Journal On Matrix Analysis and Applications., 30 (4). pp. 1639-1657. ISSN 1095-7162 - 8To be clear here we mean
`adjoint`

as in the conjugate transpose of a matrix, rather than in the sense of reverse mode AD.