# ProbabilityModels

16

2

0

3

### contributors

This is alpha-quality software. It is under active development. Optimistically, I hope to have it and its dependencies reasonably well documented and tested, and all the libraries registered, by the end of the year. There is a roadmap issue here.

The primary goal of this library is to make it as easy as possible to specify models that run as quickly as possible \$-\$ providing both log densities and the associated gradients. This allows the library to be a front end to Hamiltonian Monte Carlo backends, such as DynamicHMC.jl. AdvancedHMC.jl/Turing's NUTS is also probably worth looking into, as it reportedly converges the most reliably, at least within DiffEqBayes.

A brief introduction.

First, you specify a model using a DSL:

``````using PaddedMatrices, StructuredMatrices, DistributionParameters, LoopVectorization
using InplaceDHMC, VectorizedRNG
using Random, SpecialFunctions, MCMCChainSummaries, LinearAlgebra
using ProbabilityModels, LoopVectorization, SLEEFPirates, SIMDPirates, ProbabilityDistributions, PaddedMatrices
using ProbabilityModels: HierarchicalCentering, ∂HierarchicalCentering, ITPExpectedValue, ∂ITPExpectedValue

@model ITPModel begin
# Non-hierarchical Priors
ρ ~ Beta(3, 1)
lκ ~ lsgg(8.5, 1.5, 3.0, 1.5271796258079011) # μ = 1, σ² = 10
σ ~ Gamma(1.5, 0.25) # μ = 6, σ² = 2.4
θ ~ Normal(10)
L ~ LKJ(2.0)
# Hierarchical Priors.
# h subscript, for highest in the hierarhcy.
μₕ₁ ~ Normal(10) # μ = 0
μₕ₂ ~ Normal(10) # μ = 0
σₕ ~ Normal(10) # μ = 0
# Raw μs; non-cenetered parameterization
μᵣ₁ ~ Normal() # μ = 0, σ = 1
μᵣ₂ ~ Normal() # μ = 0, σ = 1
# Center the μs
μᵦ₁ = HierarchicalCentering(μᵣ₁, μₕ₁, σₕ)
μᵦ₂ = HierarchicalCentering(μᵣ₂, μₕ₂, σₕ)
σᵦ ~ Normal(10) # μ = 0
# Raw βs; non-cenetered parameterization
βᵣ₁ ~ Normal()
βᵣ₂ ~ Normal()
# Center the βs.
β₁ = HierarchicalCentering(βᵣ₁, μᵦ₁, σᵦ, domains)
β₂ = HierarchicalCentering(βᵣ₂, μᵦ₂, σᵦ, domains)
# Likelihood
κ = vexp(lκ)
μ₁ = vec(ITPExpectedValue(time, β₁, κ, θ))
μ₂ = vec(ITPExpectedValue(time, β₂, κ, θ))
Σ = CovarianceMatrix(ρ, Diagonal(σ) * L, time)
# Tuple (Y₁, Y₂)
(Y₁, Y₂) ~ Normal((μ₁, μ₂)[AvailableData], Σ[AvailableData])
end
# Defined model: ITPModel.
# Unknowns: Y₂, domains, μₕ₂, μᵣ₁, time, σ, AvailableData, σᵦ, θ, μᵣ₂, ρ, σₕ, lκ, μₕ₁, L, Y₁, βᵣ₂, βᵣ₁.
``````

The `@model` macro uses the following expression to define a struct and several functions. The struct has one field for each unknown.

We can then define an instance of the struct, where we specify each of these unknowns either with an instance (e.g. assign a piece of data), or with a type. Those specified with a type are unknown parameters (of that type); those specified with an instance treat that instance as known priors or data. For example, (if we had the appropriate functions defined in scope), we could create an instance with:

``````K = 7; D = 3
data = ITPModel(
domains = domains,
AvailableData = missingness,
Y₁ = Y₁,
Y₂ = Y₂,
time = time_vector,
L = LKJCorrCholesky{K},
ρ = RealVector{K,Bounds(0,1)},
lκ = RealVector{K},
θ = RealVector{K},
μₕ₁ = RealFloat,
μₕ₂ = RealFloat,
μᵣ₁ = RealVector{D},
μᵣ₂ = RealVector{D},
βᵣ₁ = RealVector{K},
βᵣ₂ = RealVector{K},
σᵦ = RealFloat{Bounds(0,Inf)},
σₕ = RealFloat{Bounds(0,Inf)},
σ = RealVector{K,Bounds(0,Inf)}
)
``````

This would let `Y₁` equal the variable `Y₁`, while `ρ` is a vector (with length `K=7`) of unknown parameters bounded between 0 and 1. The default bounds are `(-Inf,Inf)`.

Before spewing boilerplate to generate random true values and fake data, a brief summary of the model: We have longitudinal multivariate observations for some number of subjects. However, not all observations are measured at all times. That is, while subjects may be measured at multiple times (I use \$T=36\$ times below), only some measurements are taken at any given time, yielding missing data. Therefore, we subset the full covariance matrix (produced from a vector of autocorrelations, and the Cholesky factor of a covariance matrix across measurements) to find the marginal.

The expected value is function of time (`ITPExpectedValue`). This function returns a matrix (`time x measurement`), so we `vec` it and subset it.

We also expect some measurements to bare more in common than others, so we group them into "domains", and provide hierarchical priors. We use a non-cenetered parameterization, and the function `HierarchicalCentering` then centers our parameters for us. There are two methods we use above: one takes scalars, to transform a vector. The other accepts different domain means and standard deviations, and uses these to transform a vector, taking the indices from the `Domains` argument. That is, if the first element of `Domains` is 2, indicating that the first 2 measurements belong to the first domain, it will transform the first two elements of `βᵣ₁` with the first element of `μᵦ₁` and `σᵦ` (if either `μᵦ₁` or `σᵦ` are scalars, they will be broadcasted across each domain).

``````function rinvscaledgamma(::Val{N},a::T,b::T,c::T) where {N,T}
rg = MutableFixedSizeVector{N,T}(undef)
log100 = log(100)
@inbounds for n ∈ 1:N
rg[n] = log100 / exp(log(VectorizedRNG.randgamma(a/c)) / c + b)
end
rg
end

const domains = ProbabilityModels.Domains(2,2,3)

const n_endpoints = sum(domains)

const times = MutableFixedSizeVector{36,Float64,36}(undef); times .= 0:35;

structured_missing_pattern = push!(vcat(([1,0,0,0,0] for i ∈ 1:7)...), 1);
missing_pattern = vcat(
structured_missing_pattern, fill(1, 4length(times)), structured_missing_pattern, structured_missing_pattern
);

const availabledata = MissingDataVector{Float64}(missing_pattern);

const κ₀ = (8.5, 1.5, 3.0)

AR1(ρ, t) = @. ρ ^ abs(t - t')

function generate_true_parameters(domains, times, κ₀)
K = sum(domains)
D = length(domains)
T = length(times)
μ = MutableFixedSizeVector{K,Float64}(undef)
offset = 0
for i ∈ domains
domain_mean = 5randn()
for j ∈ 1+offset:i+offset
μ[j] = domain_mean + 5randn()
end
offset += i
end
σ = VectorizedRNG.randgamma( 6.0, 1/6.0)
ρ = VectorizedRNG.randbeta(4.0,4.0)
κ = rinvscaledgamma(Val(K), κ₀...)
lt = last(times)
β = MutableFixedSizeVector{7,Float64,7}(( 0.0625,  0.0575,  0.0525,  0.0475,  0.0425,  0.04,  0.0375))
θ₁ = @. μ' - β' * ( 1.0 - exp( - κ' * times) ) / (1.0 - exp( - κ' * lt) )
θ₂ = @. μ' + β' * ( 1.0 - exp( - κ' * times) ) / (1.0 - exp( - κ' * lt) )

L_T, info = LAPACK.potrf!('L', AR1(ρ, times))
@inbounds for tc ∈ 2:T, tr ∈ 1:tc-1
L_T[tr,tc] = 0.0
end
U_K, info = LAPACK.potrf!('U', BLAS.syrk!('U', 'N', σ, X, 0.0, zero(MutableFixedSizeMatrix{K,K,Float64,K})))
(
U_K = U_K, L_T = L_T, μ = μ, θ₁ = θ₁, θ₂ = θ₂, domains = domains, time = times
)
end

@generated function randomize!(
A::AbstractArray{T,P},
) where {M,N,T,P}
quote
@boundscheck begin
d = size(A,1)
for p ∈ 2:\$(P-1)
d *= size(A,p)
end
d == M*N || PaddedMatrices.ThrowBoundsError("Earlier dims size(A) == \$(size(A)) does not match size(D) == (\$M,\$N)")
end
ptr = pointer(sp, \$T)
E = PtrMatrix{\$M,\$N,\$T,\$M}( ptr )
F = PtrMatrix{\$M,\$N,\$T,\$M}( ptr + \$(sizeof(T) * M * N) )
ptr_A = pointer(A)
GC.@preserve A begin
for n ∈ 0:size(A,\$P)-1
Aₙ = PtrMatrix{\$M,\$N,\$T,\$M}( ptr_A + n*\$(sizeof(T)*M*N) )
randn!(ProbabilityModels.GLOBAL_PCGs[1], E)
mul!(F, B, E)
Aₙ .= D
end
end
sp
end
end

sample_data( N, truth, missingness, missingvals = (Val{0}(),Val{0}()) ) = sample_data( N, truth, missingness, missingvals, truth.domains )
@generated function sample_data(
N::Tuple{Int,Int},
truth, missingness,
::Tuple{Val{M1},Val{M2}},
::ProbabilityModels.Domains{S}
) where {S,M1,M2}
K = sum(S)
D = length(S)
quote
N₁, N₂ = N
L_T = truth.L_T
U_K = truth.U_K
T = size(L_T,1)

sp = ProbabilityModels.STACK_POINTER_REF[]
(sp,Y₁) = PaddedMatrices.DynamicPtrArray{Float64,3}(sp, (T, \$K, N₁), T)
(sp,Y₂) = PaddedMatrices.DynamicPtrArray{Float64,3}(sp, (T, \$K, N₂), T)
randomize!(sp, Y₁, L_T, U_K, truth.θ₁)
randomize!(sp, Y₂, L_T, U_K, truth.θ₂)

c = length(missingness.indices)
inds = missingness.indices

Y₁sub = reshape(Y₁, (T * \$K, N₁))[inds, :]
\$(M1 > 0 ? quote
Y₁union = Array{Union{Missing,Float64}}(Y₁sub)
perm = randperm(length(Y₁sub))
@inbounds for m in 1:\$M1
Y₁union[perm[m]] = Base.missing
end
Y₁ = convert(MissingDataArray{\$M1,Bounds(-Inf,Inf)}, Y₁union)
end : quote
Y₁ = Y₁sub
end)

Y₂sub = reshape(Y₂, (T * \$K, N₂))[inds, :]
\$(M2 > 0 ? quote
Y₂union = Array{Union{Missing,Float64}}(Y₂sub)
perm = randperm(length(Y₂sub))
@inbounds for m in 1:\$M2
Y₂union[perm[m]] = Base.missing
end
Y₂ = convert(MissingDataArray{\$M2,Bounds(-Inf,Inf)},Y₂union)
end : quote
Y₂ = Y₂sub
end)

ITPModel(
domains = truth.domains,
AvailableData = missingness,
Y₁ = Y₁,
Y₂ = Y₂,
time = truth.time,
L = LKJCorrCholesky{\$K},
ρ = RealVector{\$K,Bounds(0,1)},
lκ = RealVector{\$K},
θ = RealVector{\$K},
μₕ₁ = RealFloat,
μₕ₂ = RealFloat,
μᵣ₁ = RealVector{\$D},
μᵣ₂ = RealVector{\$D},
βᵣ₁ = RealVector{\$K},
βᵣ₂ = RealVector{\$K},
σᵦ = RealFloat{Bounds(0,Inf)},
σₕ = RealFloat{Bounds(0,Inf)},
σ = RealVector{\$K,Bounds(0,Inf)}
)
end
end
``````

The library DistributionParameters.jl provides a variety of parameter types. These types define constrianing transformations, to transform an unconstrained parameter vector and add the appropriate jacobians.

All parameters are typed by size. The library currently provides a DynamicHMC interface, defining `logdensity(::Value,::ITPModel)` and `logdensity(::ValueGradient,::ITPModel)` methods.

``````
``````

For comparison, a Stan implementation of this model takes about to 13ms to evaluate the gradient.

The `@model` macro also defines a helper function for constraining unconstrained parameter vectors:

``````
``````

So you can use this to constrain the unconstrained parameter vectors `DynamicHMC` sampled and proceed with your convergence assessments and posterior analysis as normal.

Alternatively, it also supports MCMCChainSummaries.jl, constraining the parameters for you and providing posterior summaries as well as plotting methods:

``````
``````

The difference in time between the slowest chain (748 s) and the overall time (775 s) roughly yields the compilation time. If we refit the model, the total time would roughly equal the length of time to sample from the slowest chain.

If you don't pass the sampler a tuner object, it'll create one by default with `M=5` and `term=50`. The section on Automatic Parameter Tuning in the Stan reference manual explains what these mean. I added an extra slow adaptation step (with twice the length of the previous step) and doubled the length of the terminal adaptation window (a final window calculating step size), as this makes adaptation more consistent. The idea is to try and decrease the probability of one or two chains being much slower than the others. We can look at NUTS statistics of the chains:

``````julia> NUTS_statistics.(chains)
18-element Array{DynamicHMC.NUTS_Statistics{Float64,DataStructures.Accumulator{DynamicHMC.Termination,Int64},DataStructures.Accumulator{Int64,Int64}},1}:
Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.94, min/25%/median/75%/max: 0.0 0.94 0.98 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 13% 6 => 87%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.96 0.98 0.99 1.0
depth: 4 => 0% 5 => 1% 6 => 99% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.96 0.98 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 1% 6 => 99% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.97 0.99 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 0% 6 => 100% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.98, min/25%/median/75%/max: 0.0 0.98 0.99 1.0 1.0
depth: 4 => 0% 5 => 0% 6 => 93% 7 => 7% 8 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.97 0.99 0.99 1.0
depth: 3 => 0% 4 => 0% 5 => 0% 6 => 100% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.94, min/25%/median/75%/max: 0.0 0.95 0.98 0.99 1.0
depth: 3 => 0% 4 => 0% 5 => 8% 6 => 92% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.95, min/25%/median/75%/max: 0.0 0.96 0.98 0.99 1.0
depth: 3 => 0% 4 => 0% 5 => 1% 6 => 99% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.94, min/25%/median/75%/max: 0.0 0.95 0.98 0.99 1.0
depth: 1 => 0% 2 => 0% 3 => 0% 4 => 0% 5 => 6% 6 => 94% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.94, min/25%/median/75%/max: 0.0 0.94 0.98 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 4% 6 => 96% 8 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.91, min/25%/median/75%/max: 0.0 0.92 0.97 0.99 1.0
depth: 1 => 0% 2 => 0% 3 => 0% 4 => 0% 5 => 28% 6 => 71% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.96 0.98 0.99 1.0
depth: 1 => 0% 2 => 0% 3 => 0% 4 => 0% 5 => 1% 6 => 99% 7 => 0% 8 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.96 0.98 0.99 1.0
depth: 3 => 0% 4 => 0% 5 => 1% 6 => 99% 7 => 0% 8 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.94, min/25%/median/75%/max: 0.0 0.94 0.98 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 8% 6 => 92% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.95, min/25%/median/75%/max: 0.0 0.95 0.98 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 3% 6 => 97% 7 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.96, min/25%/median/75%/max: 0.0 0.96 0.98 0.99 1.0
depth: 2 => 0% 3 => 0% 4 => 0% 5 => 1% 6 => 99% 7 => 0% 8 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.91, min/25%/median/75%/max: 0.0 0.92 0.97 0.99 1.0
depth: 1 => 0% 2 => 1% 3 => 0% 4 => 1% 5 => 39% 6 => 60% 7 => 0% 8 => 0%

Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.91, min/25%/median/75%/max: 0.0 0.92 0.97 0.99 1.0
depth: 1 => 0% 2 => 1% 3 => 0% 4 => 0% 5 => 20% 6 => 78% 7 => 0%
``````

and see that the fifth chain had an acceptance rate of 0.98, vs the target acceptance rate of `δ = 0.95`. 7% of samples hit a treedepth of 7, meaning this was probably the slowest chain (worker 15). Maybe increasing term further would help.

Overview of how the library works.

The `ITPModel` struct (or whatever you've named your model, with the first argument to the macro) is defined as a struct with a field for each unknown. Each field is parametrically typed.

`logdensity` and `constrain` are defined as generated functions, so that they can compile appropriate code given these parameteric types. The `@model` macro does a little preprocessing of the expression; most of the code generation occurs within these generated functions.

The macro's preprocessing consists of simply translating the sampling statements into log probability increments (`target`, terminology taken from the Stan language), and lowering the expression.

We can manually perform these passes on the expression (note that variables with `#` in their names are NOT comments -- they're hygienic names, guaranteeing all the variables I add don't clash with any of the model's variables):

``````itp_q = quote
# Non-hierarchical Priors
ρ ~ Beta(3, 1)
κ ~ Gamma(0.1, 0.1) # μ = 1, σ² = 10
σ ~ Gamma(1.5, 0.25) # μ = 6, σ² = 2.4
θ ~ Normal(10)
L ~ LKJ(2.0)

# Hierarchical Priors.
# h subscript, for highest in the hierarhcy.
μₕ₁ ~ Normal(10) # μ = 0
μₕ₂ ~ Normal(10) # μ = 0
σₕ ~ Normal(10) # μ = 0
# Raw μs; non-cenetered parameterization
μᵣ₁ ~ Normal() # μ = 0, σ = 1
μᵣ₂ ~ Normal() # μ = 0, σ = 1
# Center the μs
μᵦ₁ = HierarchicalCentering(μᵣ₁, μₕ₁, σₕ)
μᵦ₂ = HierarchicalCentering(μᵣ₂, μₕ₂, σₕ)
σᵦ ~ Normal(10) # μ = 0
# Raw βs; non-cenetered parameterization
βᵣ₁ ~ Normal()
βᵣ₂ ~ Normal()
# Center the βs.
β₁ = HierarchicalCentering(βᵣ₁, μᵦ₁, σᵦ, domains)
β₂ = HierarchicalCentering(βᵣ₂, μᵦ₂, σᵦ, domains)

# Likelihood
μ₁ = vec(ITPExpectedValue(time, β₁, κ, θ))
μ₂ = vec(ITPExpectedValue(time, β₂, κ, θ))
Σ = CovarianceMatrix(ρ, Diagonal(σ) * L, time)

(Y₁, Y₂) ~ Normal((μ₁, μ₂)[AvailableData], Σ[AvailableData])

end

itp_preprocessed = itp_q |> ProbabilityModels.translate_sampling_statements |> ProbabilityModels.flatten_expression;

using MacroTools: striplines

striplines(itp_preprocessed)
``````

This yields the following expression:

``````quote
##SSAValue##1## = Beta(ρ, 3, 1)
target = ##SSAValue##2##
##SSAValue##4## = Gamma(κ, 0.1, 0.1)
target = ##SSAValue##5##
##SSAValue##7## = Gamma(σ, 1.5, 0.25)
target = ##SSAValue##8##
##SSAValue##10## = Normal(θ, 10)
target = ##SSAValue##11##
##SSAValue##13## = LKJ(L, 2.0)
target = ##SSAValue##14##
##SSAValue##16## = Normal(μₕ₁, 10)
target = ##SSAValue##17##
##SSAValue##19## = Normal(μₕ₂, 10)
target = ##SSAValue##20##
##SSAValue##22## = Normal(σₕ, 10)
target = ##SSAValue##23##
##SSAValue##25## = Normal(μᵣ₁)
target = ##SSAValue##26##
##SSAValue##28## = Normal(μᵣ₂)
target = ##SSAValue##29##
##SSAValue##31## = HierarchicalCentering(μᵣ₁, μₕ₁, σₕ)
μᵦ₁ = ##SSAValue##31##
##SSAValue##33## = HierarchicalCentering(μᵣ₂, μₕ₂, σₕ)
μᵦ₂ = ##SSAValue##33##
##SSAValue##35## = Normal(σᵦ, 10)
target = ##SSAValue##36##
##SSAValue##38## = Normal(βᵣ₁)
target = ##SSAValue##39##
##SSAValue##41## = Normal(βᵣ₂)
target = ##SSAValue##42##
##SSAValue##44## = HierarchicalCentering(βᵣ₁, μᵦ₁, σᵦ, domains)
β₁ = ##SSAValue##44##
##SSAValue##46## = HierarchicalCentering(βᵣ₂, μᵦ₂, σᵦ, domains)
β₂ = ##SSAValue##46##
##SSAValue##48## = ITPExpectedValue(time, β₁, κ, θ)
##SSAValue##49## = vec(##SSAValue##48##)
μ₁ = ##SSAValue##49##
##SSAValue##51## = ITPExpectedValue(time, β₂, κ, θ)
##SSAValue##52## = vec(##SSAValue##51##)
μ₂ = ##SSAValue##52##
##SSAValue##54## = Diagonal(σ)
##SSAValue##55## = ##SSAValue##54## * L
##SSAValue##56## = CovarianceMatrix(ρ, ##SSAValue##55##, time)
Σ = ##SSAValue##56##
##SSAValue##58## = Core.tuple(Y₁, Y₂)
##SSAValue##59## = Core.tuple(μ₁, μ₂)
##SSAValue##60## = Base.getindex(##SSAValue##59##, AvailableData)
##SSAValue##61## = Base.getindex(Σ, AvailableData)
##SSAValue##62## = Normal(##SSAValue##58##, ##SSAValue##60##, ##SSAValue##61##)
target = ##SSAValue##63##
end
``````

After translating the sampling statements into function calls, the code is transformed with `Meta.lower`. The resulting `Code.Info` is then transformed back into a Julia expression, as you see above. This has the advantage of flattening the expression, so that there is at most only a single function call per line.

The `logdensity` generated functions are defined with the above expressions assigned to a variable, so that they can apply additional transformations.

I'll focus on the `ValueGradient` method, as this one performs the more interesting transformations. Aside from loading and constraining all the parameters from an input vector, it performs a `reverse_diff!` pass on the expression.

``````forward_pass = quote end
reverse_pass = quote end
tracked_vars = Set([:ρ, :κ, :σ, :θ, :L, :μₕ₁, :μₕ₂, :σₕ, :μᵣ₁, :μᵣ₂, :σᵦ, :βᵣ₁, :βᵣ₂, ]);
ProbabilityModels.reverse_diff_pass!(forward_pass, reverse_pass, itp_preprocessed, tracked_vars);
striplines(forward_pass)
``````

It walks the expression, replacing each function with a function that also returns the adjoint; the forward pass becomes:

``````quote
(##SSAValue##1##, ###adjoint###_##∂##SSAValue##1####∂ρ##) = ProbabilityDistributions.∂Beta(ρ, 3, 1, Val{(true, false, false)}())
target = ##SSAValue##2##
(##SSAValue##4##, ###adjoint###_##∂##SSAValue##4####∂κ##) = ProbabilityDistributions.∂Gamma(κ, 0.1, 0.1, Val{(true, false, false)}())
target = ##SSAValue##5##
(##SSAValue##7##, ###adjoint###_##∂##SSAValue##7####∂σ##) = ProbabilityDistributions.∂Gamma(σ, 1.5, 0.25, Val{(true, false, false)}())
target = ##SSAValue##8##
(##SSAValue##10##, ###adjoint###_##∂##SSAValue##10####∂θ##) = ProbabilityDistributions.∂Normal(θ, 10, Val{(true, false)}())
target = ##SSAValue##11##
(##SSAValue##13##, ###adjoint###_##∂##SSAValue##13####∂L##) = ProbabilityDistributions.∂LKJ(L, 2.0, Val{(true, false)}())
target = ##SSAValue##14##
(##SSAValue##16##, ###adjoint###_##∂##SSAValue##16####∂μₕ₁##) = ProbabilityDistributions.∂Normal(μₕ₁, 10, Val{(true, false)}())
target = ##SSAValue##17##
(##SSAValue##19##, ###adjoint###_##∂##SSAValue##19####∂μₕ₂##) = ProbabilityDistributions.∂Normal(μₕ₂, 10, Val{(true, false)}())
target = ##SSAValue##20##
(##SSAValue##22##, ###adjoint###_##∂##SSAValue##22####∂σₕ##) = ProbabilityDistributions.∂Normal(σₕ, 10, Val{(true, false)}())
target = ##SSAValue##23##
target = ##SSAValue##26##
target = ##SSAValue##29##
μᵦ₁ = ##SSAValue##31##
μᵦ₂ = ##SSAValue##33##
(##SSAValue##35##, ###adjoint###_##∂##SSAValue##35####∂σᵦ##) = ProbabilityDistributions.∂Normal(σᵦ, 10, Val{(true, false)}())
target = ##SSAValue##36##
target = ##SSAValue##39##
target = ##SSAValue##42##
β₁ = ##SSAValue##44##
β₂ = ##SSAValue##46##
μ₁ = ##SSAValue##49##
μ₂ = ##SSAValue##52##
##SSAValue##54## = LinearAlgebra.Diagonal(σ)
##SSAValue##55## = ##SSAValue##54## * L
Σ = ##SSAValue##56##
##SSAValue##58## = Core.tuple(Y₁, Y₂)
##SSAValue##59## = Core.tuple(μ₁, μ₂)
target = ##SSAValue##63##
end
``````

It also prepends the corresponding operations to a second expression for the reverse diff pass:

``````striplines(reverse_pass)
``````

yielding:

``````quote
end
``````

The `RESERVED_INCREMENT_SEED_RESERVED` and `RESERVED_MULTIPLY_SEED_RESERVED` fall back to `mulladd` and `*` methods, respectively. They however allow for different overloads.

The keys to performance are that this source-to-source reverse mode differentation avoids unnecessary overhead, and does not interfere with vectorization, which typical dual-number based approaches, like Stan's var type or ForwardDiff's duals, do. The reverse diff pass uses DiffRules.jl as well as custom derivatives defined for probability distributions and a few other special functions.

By defining high level derivatives for commonly used functions we can optimize them much further than if we relied on autodiff for them. Additionally, we abuse multiple dispatch here: whenevever there is some structure we can exploit in the adjoint, we return a type with `INCREMENT/MULTIPLY` methods defined to exploit it. Examples include BlockDiagonalColumnViews or various Reducer types.

Another optimization is that it performs a stack pointer pass. Supported functions are passed a pointer to a preallocated block of memory. This memory reuse can both give us considerable savings, and allow the memory to stay hot in the cache. PaddedMatrices.jl, aside from providing optimized matrix operations (contrary to the name, it's performance advantage over other libraries is actually greatest when none of them use padding), it also provides `PtrArray` (parameterized by size) and `DynamicPtrArray` (dynamically sized) types for taking advantage of this preallocated memory.

09/27/2018

13 days ago

131 commits