Internals (Private)
Octavian._matmul_serial!
Octavian.divide_blocks
Octavian.find_first_acceptable
Octavian.matmul_sizes
Octavian.matmul_st_only_pack_A!
Octavian.reset_bcache_lock!
Octavian.solve_block_sizes
Octavian.split_m
Octavian._matmul_serial!
— Methodmatmul_serial!(C, A, B[, α = 1, β = 0])
Calculates C = α * (A * B) + β * C
in place.
A single threaded matrix-matrix-multiply implementation. Supports dynamically and statically sized arrays.
Organizationally, matmul_serial!
checks the arrays properties to try and dispatch to an appropriate implementation. If the arrays are small and statically sized, it will dispatch to an inlined multiply.
Otherwise, based on the array's size, whether they are transposed, and whether the columns are already aligned, it decides to not pack at all, to pack only A
, or to pack both arrays A
and B
.
Octavian.divide_blocks
— Methoddivide_blocks(M, Ntotal, _nspawn, W)
Splits both M
and N
into blocks when trying to spawn a large number of threads relative to the size of the matrices.
Octavian.find_first_acceptable
— Methodfindfirstacceptable(M, W)
Finds first combination of Miter
and Niter
that doesn't make M
too small while producing Miter * Niter = num_cores()
. This would be awkard if there are computers with prime numbers of cores. I should probably consider that possibility at some point.
Octavian.matmul_sizes
— MethodChecks sizes for compatibility, and preserves the static size information if given a mix of static and dynamic sizes.
Octavian.matmul_st_only_pack_A!
— MethodOnly packs A
. Primitively does column-major packing: it packs blocks of A
into a column-major temporary.
Octavian.reset_bcache_lock!
— Methodresetbcachelock!()
Currently not using try/finally in matmul routine, despite locking. So if it errors for some reason, you may need to manually call reset_bcache_lock!()
.
Octavian.solve_block_sizes
— Methodsolveblocksizes(::Val{T}, M, K, N, α, β, R₂, R₃)
This function returns iteration/blocking descriptions Mc
, Kc
, and Nc
for use when packing both A
and B
. It tries to roughly minimize the cost
MKN / (Kc * W) + α * MKN / Mc + β * MKN / Nc
subject to constraints
Mc - M ≤ 0
Kc - K ≤ 0
Nc - N ≤ 0
Mc * Kc - L₁ₑ ≤ 0
Kc * Nc - L₂ₑ ≤ 0
That is, our constraints say that our block sizes shouldn't be bigger than the actual dimensions, and also that our packed A
(Mc × Kc
) should fit into the first packing cache (generally, actually the L₂
, and our packed B
(Kc × Nc
) should fit into the second packing cache (generally the L₃
).
Our cost model consists of three components:
- Cost of moving data in and out of registers. This is done
(M/Mᵣ * K/Kc * N/Nᵣ)
times and the cost per is(Mᵣ/W * Nᵣ)
. - Cost of moving strips from
B
pack from the low cache levels to the highest cache levels when multiplyingAₚ * Bₚ
. This is done(M / Mc * K / Kc * N / Nc)
times, and the cost per is proportional to(Kc * Nᵣ)
.α
is the proportionality-constant parameter. - Cost of packing
A
. This is done(M / Mc * K / Kc * N / Nc)
times, and the cost per is proportional to(Mc * Kc)
.β
is the proportionality-constant parameter.
As W
is a constant, we multiply the cost by W
and absorb it into α
and β
. We drop it from the description from here on out.
In the full problem, we would have Lagrangian, with μ < 0: f((Mc,Kc,Nc),(μ₁,μ₂,μ₃,μ₄,μ₅)) MKN/Kc + α * MKN/Mc + β * MKN/Nc - μ₁(Mc - M) - μ₂(Kc - K) - μ₃(Nc - N) - μ₄(McKc - L2) - μ₅(KcNc - L3)
0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₁ - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₂ - μ₄ * Mc - μ₅ * Nc
0 = ∂L / ∂Nc = -β * MKN / Nc² - μ₃ - μ₅ * Kc
0 = ∂L / ∂μ₁ = M - Mc
0 = ∂L / ∂μ₂ = K - Kc
0 = ∂L / ∂μ₃ = N - Nc
0 = ∂L / ∂μ₄ = L₁ₑ - Mc * Kc
0 = ∂L / ∂μ₅ = L₂ₑ - Kc * Nc
The first 3 constraints complicate things, because they're trivially solved by setting M = Mc
, K = Kc
, and N = Nc
. But this will violate the last two constraints in general; normally we will be on the interior of the inequalities, meaning we'd be dropping those constraints. Doing so, this leaves us with:
First, lets just solve the cost w/o constraints 1-3
0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₄ * Mc - μ₅ * Nc
0 = ∂L / ∂Nc = -β * MKN / Nc² - μ₅ * Kc
0 = ∂L / ∂μ₄ = L₁ₑ - Mc * Kc
0 = ∂L / ∂μ₅ = L₂ₑ - Kc * Nc
Solving:
Mc = √(L₁ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₂ₑ)
Kc = √(L₁ₑ) * √(L₂ₑ) / √(L₁ₑ * β + L₂ₑ * α)
Nc = √(L₂ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₁ₑ)
μ₄ = -K * √(L₂ₑ) * M * N * α / (L₁ₑ^(3 / 2) * √(L₁ₑ * β + L₂ₑ * α))
μ₅ = -K * √(L₁ₑ) * M * N * β / (L₂ₑ^(3 / 2) * √(L₁ₑ * β + L₂ₑ * α))
These solutions are indepedent of matrix size. The approach we'll take here is solving for Nc
, Kc
, and then finally Mc
one after the other, incorporating sizes.
Starting with N
, we check how many iterations would be implied by Nc
, and then choose the smallest value that would yield that number of iterations. This also ensures that Nc ≤ N
.
Niter = cld(N, √(L₂ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₁ₑ))
Nblock, Nrem = divrem(N, Niter)
Nblock_Nrem = Nblock + (Nrem > 0)
We have Nrem
iterations of size Nblock_Nrem
, and Niter - Nrem
iterations of size Nblock
.
We can now make Nc = Nblock_Nrem
a constant, and solve the remaining three equations again:
0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₄ * Mc - μ₅ * Ncm
0 = ∂L / ∂μ₄ = L₂ₑ - Mc * Kc
yielding
Mc = √(L₁ₑ) * √(α)
Kc = √(L₁ₑ) / √(α)
μ₄ = -K * M * N * √(α) / L₁ₑ^(3 / 2)
We proceed in the same fashion as for Nc
, being sure to reapply the Kc * Nc ≤ L₂ₑ
constraint:
Kiter = cld(K, min(√(L₁ₑ) / √(α), L₂ₑ / Nc))
Kblock, Krem = divrem(K, Ki)
Kblock_Krem = Kblock + (Krem > 0)
This leaves Mc
partitioning, for which, for which we use the constraint Mc * Kc ≤ L₁ₑ
to set the initial number of proposed iterations as cld(M, L₁ₑ / Kcm)
for calling split_m
. # approximate ceil
Mbsize, Mrem, Mremfinal, Mblocks = split_m(M, cld(M, L₁ₑ / Kcm), StaticInt{W}())
Note that for synchronization on B
, all threads must have the same values for Kc
and Nc
. K
and N
will be equal between threads, but M
may differ. By calculating Kc
and Nc
independently of M
, this algorithm guarantees all threads are on the same page.
Octavian.split_m
— Methodsplit_m(M, Miters_base, W)
Splits M
into at most Miters_base
iterations. For example, if we wish to divide 517
iterations into roughly 7 blocks using multiples of 8
:
julia> split_m(517, 7, 8)
(72, 2, 69, 7)
This suggests we have base block sizes of size 72
, with two iterations requiring an extra remainder of 8 ( = W)
, and a final block of 69
to handle the remainder. It also tells us that there are 7
total iterations, as requested.
julia> 80 * 2 + 72 * (7 - 2 - 1) + 69
517
This is meant to specify roughly the requested amount of blocks, and return relatively even sizes.
This method is used fairly generally.