FAQ
What is up with the different symbols?
Δx
, ∂x
, dx
ChainRules uses these perhaps atypically. As a notation that is the same across propagators, regardless of direction (incontrast see ẋ
and x̄
below).
Δx
is the input to a propagator, (i.e a seed for a pullback; or a perturbation for a pushforward)∂x
is the output of a propagatordx
could be eitherinput
oroutput
dots and bars: $\dot{y} = \dfrac{∂y}{∂x} = \overline{x}$
v̇
is a derivative of the input moving forward: $v̇ = \frac{∂v}{∂x}$ for input $x$, intermediate value $v$.v̄
is a derivative of the output moving backward: $v̄ = \frac{∂y}{∂v}$ for output $y$, intermediate value $v$.
others
Ω
is often used as the return value of the function. Especially, but not exclusively, for scalar functions.ΔΩ
is thus a seed for the pullback.∂Ω
is thus the output of a pushforward.
Why does rrule
return the primal function evaluation?
You might wonder why frule(f, x)
returns f(x)
and the derivative of f
at x
, and similarly for rrule
returning f(x)
and the pullback for f
at x
. Why not just return the pushforward/pullback, and let the user call f(x)
to get the answer separately?
There are three reasons the rules also calculate the f(x)
.
- For some rules an alternative way of calculating
f(x)
can give the same answer while also generating intermediate values that can be used in the calculations required to propagate the derivative. - For many
rrule
s the output value is used in the definition of the pullback. For exampletan
,sigmoid
etc. - For some
frule
s there exists a single, non-separable operation that will compute both derivative and primal result. For example many of the methods for differential equation sensitivity analysis.
Where are the derivatives for keyword arguments?
pullbacks do not return a sensitivity for keyword arguments; similarly pushfowards do not accept a perturbation for keyword arguments. This is because in practice functions are very rarely differentiable with respect to keyword arguments. As a rule keyword arguments tend to control side-effects, like logging verbosity, or to be functionality changing to perform a different operation, e.g. dims=3
, and thus not differentiable. To the best of our knowledge no Julia AD system, with support for the definition of custom primitives, supports differentiating with respect to keyword arguments. At some point in the future ChainRules may support these. Maybe.
What is the difference between ZeroTangent
and NoTangent
?
ZeroTangent
and NoTangent
act almost exactly the same in practice: they result in no change whenever added to anything. Odds are if you write a rule that returns the wrong one everything will just work fine. We provide both to allow for clearer writing of rules, and easier debugging.
ZeroTangent()
represents the fact that if one perturbs (adds a small change to) the matching primal there will be no change in the behaviour of the primal function. For example in fst(x,y) = x
, then the derivative of fst
with respect to y
is ZeroTangent()
. fst(10, 5) == 10
and if we add 0.1
to 5
we still get fst(10, 5.1)=10
.
NoTangent()
represents the fact that if one perturbs the matching primal, the primal function will now error. For example in access(xs, n) = xs[n]
then the derivative of access
with respect to n
is NoTangent()
. access([10, 20, 30], 2) = 20
, but if we add 0.1
to 2
we get access([10, 20, 30], 2.1)
which errors as indexing can't be applied at fractional indexes.
When to use ChainRules vs ChainRulesCore?
ChainRulesCore.jl is a light-weight dependency for defining rules for functions in your packages, without you needing to depend on ChainRules.jl itself. It has almost no dependencies of its own. If you only want to define rules, not use them, then you probably only want to load ChainRulesCore.jl.
ChainRules.jl provides the full functionality for AD systems, in particular it has all the rules for Base Julia and the standard libraries. It is thus a much heavier package to load. AD systems making use of frule
s and rrule
s should load ChainRules.jl.
Where should I put my rules?
We recommend adding custom rules to your own packages with ChainRulesCore.jl, rather than adding them to ChainRules.jl. A few packages - currently SpecialFunctions.jl and NaNMath.jl - have rules in ChainRules.jl as a short-term measure.
How do I test my rules?
You can use ChainRulesTestUtils.jl to test your custom rules. ChainRulesTestUtils.jl has some dependencies, so it is a separate package from ChainRulesCore.jl. This means your package can depend on the light-weight ChainRulesCore.jl, and make ChainRulesTestUtils.jl a test-only dependency.
Remember to read the section on On writing good rrule
/ frule
methods.
Where can I learn more about AD ?
There are not so many truly excellent learning resources for autodiff out there in the world, which is a bit sad. The list here is incomplete, but is vetted for quality.
Automatic Differentiation for Dummies keynote video by Simon Peyton Jones: particularly good if you like pure math type thinking.
"What types work with differentiation? comment on DexLang GitHub issue by Dan Zheng: summarizes several years of insights from the Swift AD work.
MIT 18337 lecture notes 8-10 (by Christopher Rackauckas and David P. Sanders : moves fast from basic to advanced, particularly good if you like applicable mathematics
- Automatic Differentiation and Application: Good introduction
- Forward-Mode AD via High Dimensional Algebras: actually part 2 of the introduction
- Solving Stiff Ordinary Differential Equations: ignore the ODE stuff, most of this is about Sparse AutoDiff, can skip/skim this one
- Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems: use in optimization, and details connections of other math.
Diff-Zoo Jupyter Notebook Book (by Mike Innes, has implementations and explanations.
"Evaluating Derivatives" (by Griewank and Walther) is the best book at least for reverse-mode.
It also covers forward-mode though (by its own admission) not as well, it never mentioned dual numbers which is an unfortunate lack.