Advanced tutorial
We present contexts and sparsity handling with DifferentiationInterface.jl.
using ADTypes
using BenchmarkTools
using DifferentiationInterface
using ForwardDiff: ForwardDiff
using Zygote: Zygote
using Random
using SparseConnectivityTracer
using SparseMatrixColoringsContexts
Assume you want differentiate a multi-argument function with respect to the first argument.
f_multiarg(x, c) = c * sum(abs2, x)The first way, which works with every backend, is to create a closure:
f_singlearg(c) = x -> f_multiarg(x, c)Let's see it in action:
backend = AutoForwardDiff()
x = float.(1:3)
gradient(f_singlearg(10), backend, x)3-element Vector{Float64}:
20.0
40.0
60.0However, for performance reasons, it is sometimes preferrable to avoid closures and pass all arguments to the original function. We can do this by wrapping c into a Constant and giving this constant to the gradient operator.
gradient(f_multiarg, backend, x, Constant(10))3-element Vector{Float64}:
20.0
40.0
60.0Preparation also works in this case, even if the constant changes before execution:
prep_other_constant = prepare_gradient(f_multiarg, backend, x, Constant(-1))
gradient(f_multiarg, prep_other_constant, backend, x, Constant(10))3-element Vector{Float64}:
20.0
40.0
60.0For additional arguments which act as mutated buffers, the Cache wrapper is the appropriate choice instead of Constant.
Sparsity
If you use DifferentiationInterface's Sparse AD functionality in your research, please cite our preprint Sparser, Better, Faster, Stronger: Efficient Automatic Differentiation for Sparse Jacobians and Hessians.
Sparse AD is very useful when Jacobian or Hessian matrices have a lot of zeros. So let us write functions that satisfy this property.
f_sparse_vector(x::AbstractVector) = diff(x .^ 2) + diff(reverse(x .^ 2))
f_sparse_scalar(x::AbstractVector) = sum(f_sparse_vector(x) .^ 2)Dense backends
When we use the jacobian or hessian operator with a dense backend, we get a dense matrix with plenty of zeros.
x = float.(1:8);8-element Vector{Float64}:
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0dense_forward_backend = AutoForwardDiff()
J_dense = jacobian(f_sparse_vector, dense_forward_backend, x)7×8 Matrix{Float64}:
-2.0 4.0 0.0 0.0 0.0 0.0 14.0 -16.0
0.0 -4.0 6.0 0.0 0.0 12.0 -14.0 0.0
0.0 0.0 -6.0 8.0 10.0 -12.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 6.0 -8.0 -10.0 12.0 0.0 0.0
0.0 4.0 -6.0 0.0 0.0 -12.0 14.0 0.0
2.0 -4.0 0.0 0.0 0.0 0.0 -14.0 16.0dense_second_order_backend = SecondOrder(AutoForwardDiff(), AutoZygote())
H_dense = hessian(f_sparse_scalar, dense_second_order_backend, x)8×8 Matrix{Float64}:
112.0 -32.0 0.0 0.0 0.0 0.0 -112.0 128.0
-32.0 96.0 -96.0 0.0 0.0 -192.0 448.0 -256.0
0.0 -96.0 256.0 -192.0 -240.0 576.0 -336.0 0.0
0.0 0.0 -192.0 224.0 320.0 -384.0 0.0 0.0
0.0 0.0 -240.0 320.0 368.0 -480.0 0.0 0.0
0.0 -192.0 576.0 -384.0 -480.0 1120.0 -672.0 0.0
-112.0 448.0 -336.0 0.0 0.0 -672.0 1536.0 -896.0
128.0 -256.0 0.0 0.0 0.0 0.0 -896.0 1120.0The results are correct but the procedure is very slow. By using a sparse backend, we can get the runtime to increase with the number of nonzero elements, instead of the total number of elements.
Sparse backends
Recipe to create a sparse backend: combine a dense backend, a sparsity detector and a compatible coloring algorithm inside AutoSparse. The following are reasonable defaults:
sparse_forward_backend = AutoSparse(
dense_forward_backend; # any object from ADTypes
sparsity_detector=TracerSparsityDetector(),
coloring_algorithm=GreedyColoringAlgorithm(),
)
sparse_second_order_backend = AutoSparse(
dense_second_order_backend; # any object from ADTypes or a SecondOrder from DI
sparsity_detector=TracerSparsityDetector(),
coloring_algorithm=GreedyColoringAlgorithm(),
)Now the resulting matrices are sparse:
jacobian(f_sparse_vector, sparse_forward_backend, x)7×8 SparseArrays.SparseMatrixCSC{Float64, Int64} with 26 stored entries:
-2.0 4.0 ⋅ ⋅ ⋅ ⋅ 14.0 -16.0
⋅ -4.0 6.0 ⋅ ⋅ 12.0 -14.0 ⋅
⋅ ⋅ -6.0 8.0 10.0 -12.0 ⋅ ⋅
⋅ ⋅ ⋅ 0.0 0.0 ⋅ ⋅ ⋅
⋅ ⋅ 6.0 -8.0 -10.0 12.0 ⋅ ⋅
⋅ 4.0 -6.0 ⋅ ⋅ -12.0 14.0 ⋅
2.0 -4.0 ⋅ ⋅ ⋅ ⋅ -14.0 16.0hessian(f_sparse_scalar, sparse_second_order_backend, x)8×8 SparseArrays.SparseMatrixCSC{Float64, Int64} with 40 stored entries:
112.0 -32.0 ⋅ ⋅ ⋅ ⋅ -112.0 128.0
-32.0 96.0 -96.0 ⋅ ⋅ -192.0 448.0 -256.0
⋅ -96.0 256.0 -192.0 -240.0 576.0 -336.0 ⋅
⋅ ⋅ -192.0 224.0 320.0 -384.0 ⋅ ⋅
⋅ ⋅ -240.0 320.0 368.0 -480.0 ⋅ ⋅
⋅ -192.0 576.0 -384.0 -480.0 1120.0 -672.0 ⋅
-112.0 448.0 -336.0 ⋅ ⋅ -672.0 1536.0 -896.0
128.0 -256.0 ⋅ ⋅ ⋅ ⋅ -896.0 1120.0Sparse preparation
In the examples above, we didn't use preparation. Sparse preparation is more costly than dense preparation, but it is even more essential. Indeed, once preparation is done, sparse differentiation is much faster than dense differentiation, because it makes fewer calls to the underlying function.
Some result analysis functions from SparseMatrixColorings.jl can help you figure out what the preparation contains. First, it records the sparsity pattern itself (the one returned by the detector).
jac_prep = prepare_jacobian(f_sparse_vector, sparse_forward_backend, x)
sparsity_pattern(jac_prep)7×8 SparseArrays.SparseMatrixCSC{Bool, Int64} with 26 stored entries:
1 1 ⋅ ⋅ ⋅ ⋅ 1 1
⋅ 1 1 ⋅ ⋅ 1 1 ⋅
⋅ ⋅ 1 1 1 1 ⋅ ⋅
⋅ ⋅ ⋅ 1 1 ⋅ ⋅ ⋅
⋅ ⋅ 1 1 1 1 ⋅ ⋅
⋅ 1 1 ⋅ ⋅ 1 1 ⋅
1 1 ⋅ ⋅ ⋅ ⋅ 1 1In forward mode, each column of the sparsity pattern gets a color.
column_colors(jac_prep)8-element Vector{Int64}:
1
2
1
2
3
4
3
4And the colors in turn define non-overlapping groups (for Jacobians at least, Hessians are a bit more complicated).
column_groups(jac_prep)4-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}:
[1, 3]
[2, 4]
[5, 7]
[6, 8]Sparsity speedup
When preparation is used, the speedup due to sparsity becomes very visible in large dimensions.
xbig = rand(1000)jac_prep_dense = prepare_jacobian(f_sparse_vector, dense_forward_backend, zero(xbig))
@benchmark jacobian($f_sparse_vector, $jac_prep_dense, $dense_forward_backend, $xbig)BenchmarkTools.Trial: 453 samples with 1 evaluation per sample.
Range (min … max): 5.076 ms … 253.044 ms ┊ GC (min … max): 0.00% … 97.78%
Time (median): 5.716 ms ┊ GC (median): 10.91%
Time (mean ± σ): 11.552 ms ± 35.003 ms ┊ GC (mean ± σ): 54.94% ± 18.16%
█
█▄▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▆
5.08 ms Histogram: log(frequency) by time 241 ms <
Memory estimate: 57.63 MiB, allocs estimate: 1516.jac_prep_sparse = prepare_jacobian(f_sparse_vector, sparse_forward_backend, zero(xbig))
@benchmark jacobian($f_sparse_vector, $jac_prep_sparse, $sparse_forward_backend, $xbig)BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 23.003 μs … 1.583 ms ┊ GC (min … max): 0.00% … 93.28%
Time (median): 27.431 μs ┊ GC (median): 0.00%
Time (mean ± σ): 31.651 μs ± 54.817 μs ┊ GC (mean ± σ): 11.45% ± 6.70%
▂▃▅▇█▆▄▃▃▄▆▇▆▃
▂▂▃▅▇███████████████▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▁▂▂ ▄
23 μs Histogram: frequency by time 43.4 μs <
Memory estimate: 305.38 KiB, allocs estimate: 27.Better memory use can be achieved by pre-allocating the matrix from the preparation result (so that it has the correct structure).
jac_buffer = similar(sparsity_pattern(jac_prep_sparse), eltype(xbig))
@benchmark jacobian!(
$f_sparse_vector, $jac_buffer, $jac_prep_sparse, $sparse_forward_backend, $xbig
)BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 19.928 μs … 1.266 ms ┊ GC (min … max): 0.00% … 93.21%
Time (median): 23.384 μs ┊ GC (median): 0.00%
Time (mean ± σ): 26.076 μs ± 39.710 μs ┊ GC (mean ± σ): 8.36% ± 5.46%
▃▄▇▄▄▆▆█▅▃
▁▁▁▂▄▇███████████▆▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
19.9 μs Histogram: frequency by time 36.9 μs <
Memory estimate: 234.80 KiB, allocs estimate: 18.And for optimal speed, one should write non-allocating and type-stable functions.
function f_sparse_vector!(y::AbstractVector, x::AbstractVector)
n = length(x)
for i in eachindex(y)
y[i] = abs2(x[i + 1]) - abs2(x[i]) + abs2(x[n - i]) - abs2(x[n - i + 1])
end
return nothing
end
ybig = zeros(length(xbig) - 1)
f_sparse_vector!(ybig, xbig)
ybig ≈ f_sparse_vector(xbig)trueIn this case, the sparse Jacobian should also become non-allocating (for our specific choice of backend).
jac_prep_sparse_nonallocating = prepare_jacobian(
f_sparse_vector!, zero(ybig), sparse_forward_backend, zero(xbig)
)
jac_buffer = similar(sparsity_pattern(jac_prep_sparse_nonallocating), eltype(xbig))
@benchmark jacobian!(
$f_sparse_vector!,
$ybig,
$jac_buffer,
$jac_prep_sparse_nonallocating,
$sparse_forward_backend,
$xbig,
)BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 12.252 μs … 42.179 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 12.503 μs ┊ GC (median): 0.00%
Time (mean ± σ): 12.675 μs ± 1.201 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▇▁ ▁
████▅▅▅▆▄▃▄▃▄▃▁▃▄▅▃▅▆▆▇▅▄▃▃▄▃▃▄▃▃▃▃▃▃▄▁▃▄▄▁▃▄▁▄▃▅▄▄▁▅▄▃▅▅▆▇ █
12.3 μs Histogram: log(frequency) by time 20.5 μs <
Memory estimate: 0 bytes, allocs estimate: 0.Mixed mode
Some Jacobians have a structure which includes dense rows and dense columns, like this one:
arrowhead(x) = x .+ x[1] .+ vcat(sum(x), zeros(eltype(x), length(x)-1))
jacobian_sparsity(arrowhead, x, TracerSparsityDetector())8×8 SparseArrays.SparseMatrixCSC{Bool, Int64} with 22 stored entries:
1 1 1 1 1 1 1 1
1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅
1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅
1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅
1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅
1 ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅
1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1In such cases, sparse AD is only beneficial in "mixed mode", where we combine a forward and a reverse backend. This is achieved using the MixedMode wrapper, for which we recommend a random coloring order (see RandomOrder):
sparse_mixed_backend = AutoSparse(
MixedMode(AutoForwardDiff(), AutoZygote());
sparsity_detector=TracerSparsityDetector(),
coloring_algorithm=GreedyColoringAlgorithm(RandomOrder(MersenneTwister(), 0)),
)AutoSparse(dense_ad=MixedMode{AutoForwardDiff{nothing, Nothing}, AutoZygote}(AutoForwardDiff(), AutoZygote()), sparsity_detector=SparseConnectivityTracer.TracerSparsityDetector(), coloring_algorithm=SparseMatrixColorings.GreedyColoringAlgorithm{:direct, 1, Tuple{SparseMatrixColorings.RandomOrder{Random.MersenneTwister, Int64}}}((SparseMatrixColorings.RandomOrder{Random.MersenneTwister, Int64}(Random.MersenneTwister(0x50f8267f52084f00ac3f46c6987e3a43), 0),), false))It unlocks a large speedup compared to pure forward mode, and the same would be true compared to reverse mode:
@benchmark jacobian($arrowhead, prep, $sparse_forward_backend, $xbig) setup=(
prep=prepare_jacobian(arrowhead, sparse_forward_backend, xbig)
)BenchmarkTools.Trial: 396 samples with 1 evaluation per sample.
Range (min … max): 4.671 ms … 245.490 ms ┊ GC (min … max): 0.00% … 97.86%
Time (median): 5.068 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.719 ms ± 26.769 ms ┊ GC (mean ± σ): 43.63% ± 15.91%
█
█▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▅
4.67 ms Histogram: log(frequency) by time 242 ms <
Memory estimate: 25.06 MiB, allocs estimate: 683.@benchmark jacobian($arrowhead, prep, $sparse_mixed_backend, $xbig) setup=(
prep=prepare_jacobian(arrowhead, sparse_mixed_backend, xbig)
)BenchmarkTools.Trial: 2756 samples with 1 evaluation per sample.
Range (min … max): 41.598 μs … 8.607 ms ┊ GC (min … max): 0.00% … 98.55%
Time (median): 50.425 μs ┊ GC (median): 0.00%
Time (mean ± σ): 77.630 μs ± 300.825 μs ┊ GC (mean ± σ): 34.09% ± 10.28%
█
█▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▄ █
41.6 μs Histogram: log(frequency) by time 1.76 ms <
Memory estimate: 275.08 KiB, allocs estimate: 70.