Basic tutorial

We present the main features of DifferentiationInterface.jl.

using DifferentiationInterface

Computing a gradient

A common use case of automatic differentiation (AD) is optimizing real-valued functions with first- or second-order methods. Let's define a simple objective (the squared norm) and a random input vector

f(x) = sum(abs2, x)

f (generic function with 1 method)

x = collect(1.0:5.0)

5-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0
 5.0

To compute its gradient, we need to choose a "backend", i.e. an AD package to call under the hood. Most backend types are defined by ADTypes.jl and re-exported by DifferentiationInterface.jl.

ForwardDiff.jl is very generic and efficient for low-dimensional inputs, so it's a good starting point:

using ForwardDiff: ForwardDiff

backend = AutoForwardDiff()

AutoForwardDiff()

Tip

To avoid name conflicts, load AD packages with import instead of using. Indeed, most AD packages also export operators like gradient and jacobian, but you only want to use the ones from DifferentiationInterface.jl.

Now you can use the following syntax to compute the gradient:

gradient(f, backend, x)

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

Was that fast? BenchmarkTools.jl helps you answer that question.

using BenchmarkTools

@benchmark gradient($f, $backend, $x)

BenchmarkTools.Trial: 10000 samples with 218 evaluations per sample.
 Range (min … max):  346.931 ns … 128.245 μs  ┊ GC (min … max):  0.00% … 99.49%
 Time  (median):     535.106 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   528.558 ns ±   2.403 μs  ┊ GC (mean ± σ):  11.45% ±  2.95%

   █▇                                       ▁                    
  ▄██▄▃▂▂▂▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▁▂▁▂▂▂▃▃▅▇██▆▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  347 ns           Histogram: frequency by time          639 ns <

 Memory estimate: 624 bytes, allocs estimate: 5.

Not bad, but you can do better.

Overwriting a gradient

Since you know how much space your gradient will occupy (the same as your input x), you can pre-allocate that memory and offer it to AD. Some backends get a speed boost from this trick.

grad = similar(x)
gradient!(f, grad, backend, x)
grad  # has been mutated

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

The bang indicates that one of the arguments of gradient! might be mutated. More precisely, our convention is that every positional argument between the function and the backend is mutated.

@benchmark gradient!($f, $grad, $backend, $x)

BenchmarkTools.Trial: 10000 samples with 221 evaluations per sample.
 Range (min … max):  339.367 ns … 116.648 μs  ┊ GC (min … max):  0.00% … 99.43%
 Time  (median):     540.014 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   522.260 ns ±   2.178 μs  ┊ GC (mean ± σ):  10.22% ±  2.93%

  ▆█▆▂▁     ▂▂▁                            ▂▃▄▅▆▇▇▅▃▂▁▁▁▂▃▄▃▁   ▂
  ██████▅▄▁▆█████▇▇▅▅▁▅▄▅▃▁▃▁▁▁▄▃▃▁▄▃▄▄▁▃▁▅███████████████████▇ █
  339 ns        Histogram: log(frequency) by time        613 ns <

 Memory estimate: 528 bytes, allocs estimate: 3.

For some reason the in-place version is not much better than your first attempt. However, it makes fewer allocations, thanks to the gradient vector you provided. Don't worry, you can get even more performance.

Preparing for multiple gradients

Internally, ForwardDiff.jl creates some data structures to keep track of things. These objects can be reused between gradient computations, even on different input values. We abstract away the preparation step behind a backend-agnostic syntax:

using Random
typical_x = randn!(similar(x))
prep = prepare_gradient(f, backend, typical_x)

DifferentiationInterfaceForwardDiffExt.ForwardDiffGradientPrep{Tuple{typeof(Main.f), AutoForwardDiff{nothing, Nothing}, Vector{Float64}, Tuple{}}, ForwardDiff.GradientConfig{ForwardDiff.Tag{typeof(Main.f), Float64}, Float64, 5, Vector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(Main.f), Float64}, Float64, 5}}}, Tuple{}}(Val{Tuple{typeof(Main.f), AutoForwardDiff{nothing, Nothing}, Vector{Float64}, Tuple{}}}(), ForwardDiff.GradientConfig{ForwardDiff.Tag{typeof(Main.f), Float64}, Float64, 5, Vector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(Main.f), Float64}, Float64, 5}}}((Partials(1.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 1.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 1.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 1.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 1.0)), ForwardDiff.Dual{ForwardDiff.Tag{typeof(Main.f), Float64}, Float64, 5}[Dual{ForwardDiff.Tag{typeof(Main.f), Float64}}(0.0,0.0,1.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{typeof(Main.f), Float64}}(0.0,0.0,1.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{typeof(Main.f), Float64}}(0.0,0.0,1.0,6.9185164429668e-310,0.0,6.91852028170983e-310), Dual{ForwardDiff.Tag{typeof(Main.f), Float64}}(1.0,0.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{typeof(Main.f), Float64}}(1.0,0.0,0.0,0.0,0.0,0.0)]), ())

You don't need to know what this object is, you just need to pass it to the gradient operator. Note that preparation does not depend on the actual components of the vector x, just on its type and size.

You can then reuse the prep for different values of the input.

grad = similar(x)
gradient!(f, grad, prep, backend, x)
grad  # has been mutated

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

Warning

Reusing the prep object on inputs of a different type will throw an error. Reusing the prep object on inputs of a different size may either work, fail silently or fail loudly, possibly even crash your REPL. Do not try it.

Preparation makes the gradient computation much faster, and (in this case) allocation-free.

@benchmark gradient!($f, $grad, $prep, $backend, $x)

BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (min … max):  29.895 ns … 59.880 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.570 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.842 ns ±  1.475 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▅██▆▂                                                ▁▁▁ ▂
  ▃▅██████▇███▆▅▁▃▁▄▅▄▆▆▄▁▃▃▃▄▁▃▃▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▇███ █
  29.9 ns      Histogram: log(frequency) by time      37.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Beware that the prep object is nearly always mutated by differentiation operators, even though it is given as the last positional argument.

Switching backends

The whole point of DifferentiationInterface.jl is that you can easily experiment with different AD solutions. Typically, for gradients, reverse mode AD might be a better fit, so let's try Zygote.jl!

using Zygote: Zygote

backend2 = AutoZygote()

AutoZygote()

Once the backend is created, things run smoothly with exactly the same syntax as before:

gradient(f, backend2, x)

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

And you can run the same benchmarks to see what you gained (although such a small input may not be realistic):

prep2 = prepare_gradient(f, backend2, randn!(similar(x)))

@benchmark gradient!($f, $grad, $prep2, $backend2, $x)

BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min … max):  26.163 ns …  33.973 μs  ┊ GC (min … max):  0.00% … 99.73%
 Time  (median):     50.405 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   54.110 ns ± 561.166 ns  ┊ GC (mean ± σ):  17.90% ±  1.73%

   █▅                                                           
  ▄██▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▅█▅▅▅█▅▂▁▁▁▁▂▂▂▂▃▃▄▃▂▁▁▁▁▁▁▁▁ ▂
  26.2 ns         Histogram: frequency by time         71.6 ns <

 Memory estimate: 96 bytes, allocs estimate: 2.

In short, DifferentiationInterface.jl allows for easy testing and comparison of AD backends. If you want to go further, check out the documentation of DifferentiationInterfaceTest.jl. This related package provides benchmarking utilities to compare backends and help you select the one that is best suited for your problem.