Internals (Private)
Octavian._matmul_serial!Octavian.divide_blocksOctavian.find_first_acceptableOctavian.matmul_sizesOctavian.matmul_st_only_pack_A!Octavian.reset_bcache_lock!Octavian.solve_block_sizesOctavian.split_m
Octavian._matmul_serial! — Methodmatmul_serial!(C, A, B[, α = 1, β = 0])
Calculates C = α * (A * B) + β * C in place.
A single threaded matrix-matrix-multiply implementation. Supports dynamically and statically sized arrays.
Organizationally, matmul_serial! checks the arrays properties to try and dispatch to an appropriate implementation. If the arrays are small and statically sized, it will dispatch to an inlined multiply.
Otherwise, based on the array's size, whether they are transposed, and whether the columns are already aligned, it decides to not pack at all, to pack only A, or to pack both arrays A and B.
Octavian.divide_blocks — Methoddivide_blocks(M, Ntotal, _nspawn, W)
Splits both M and N into blocks when trying to spawn a large number of threads relative to the size of the matrices.
Octavian.find_first_acceptable — Methodfindfirstacceptable(M, W)
Finds first combination of Miter and Niter that doesn't make M too small while producing Miter * Niter = num_cores(). This would be awkard if there are computers with prime numbers of cores. I should probably consider that possibility at some point.
Octavian.matmul_sizes — MethodChecks sizes for compatibility, and preserves the static size information if given a mix of static and dynamic sizes.
Octavian.matmul_st_only_pack_A! — MethodOnly packs A. Primitively does column-major packing: it packs blocks of A into a column-major temporary.
Octavian.reset_bcache_lock! — Methodresetbcachelock!()
Currently not using try/finally in matmul routine, despite locking. So if it errors for some reason, you may need to manually call reset_bcache_lock!().
Octavian.solve_block_sizes — Methodsolveblocksizes(::Val{T}, M, K, N, α, β, R₂, R₃)
This function returns iteration/blocking descriptions Mc, Kc, and Nc for use when packing both A and B. It tries to roughly minimize the cost
MKN / (Kc * W) + α * MKN / Mc + β * MKN / Ncsubject to constraints
Mc - M ≤ 0
Kc - K ≤ 0
Nc - N ≤ 0
Mc * Kc - L₁ₑ ≤ 0
Kc * Nc - L₂ₑ ≤ 0That is, our constraints say that our block sizes shouldn't be bigger than the actual dimensions, and also that our packed A (Mc × Kc) should fit into the first packing cache (generally, actually the L₂, and our packed B (Kc × Nc) should fit into the second packing cache (generally the L₃).
Our cost model consists of three components:
- Cost of moving data in and out of registers. This is done
(M/Mᵣ * K/Kc * N/Nᵣ)times and the cost per is(Mᵣ/W * Nᵣ). - Cost of moving strips from
Bpack from the low cache levels to the highest cache levels when multiplyingAₚ * Bₚ. This is done(M / Mc * K / Kc * N / Nc)times, and the cost per is proportional to(Kc * Nᵣ).αis the proportionality-constant parameter. - Cost of packing
A. This is done(M / Mc * K / Kc * N / Nc)times, and the cost per is proportional to(Mc * Kc).βis the proportionality-constant parameter.
As W is a constant, we multiply the cost by W and absorb it into α and β. We drop it from the description from here on out.
In the full problem, we would have Lagrangian, with μ < 0: f((Mc,Kc,Nc),(μ₁,μ₂,μ₃,μ₄,μ₅)) MKN/Kc + α * MKN/Mc + β * MKN/Nc - μ₁(Mc - M) - μ₂(Kc - K) - μ₃(Nc - N) - μ₄(McKc - L2) - μ₅(KcNc - L3)
0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₁ - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₂ - μ₄ * Mc - μ₅ * Nc
0 = ∂L / ∂Nc = -β * MKN / Nc² - μ₃ - μ₅ * Kc
0 = ∂L / ∂μ₁ = M - Mc
0 = ∂L / ∂μ₂ = K - Kc
0 = ∂L / ∂μ₃ = N - Nc
0 = ∂L / ∂μ₄ = L₁ₑ - Mc * Kc
0 = ∂L / ∂μ₅ = L₂ₑ - Kc * NcThe first 3 constraints complicate things, because they're trivially solved by setting M = Mc, K = Kc, and N = Nc. But this will violate the last two constraints in general; normally we will be on the interior of the inequalities, meaning we'd be dropping those constraints. Doing so, this leaves us with:
First, lets just solve the cost w/o constraints 1-3
0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₄ * Mc - μ₅ * Nc
0 = ∂L / ∂Nc = -β * MKN / Nc² - μ₅ * Kc
0 = ∂L / ∂μ₄ = L₁ₑ - Mc * Kc
0 = ∂L / ∂μ₅ = L₂ₑ - Kc * NcSolving:
Mc = √(L₁ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₂ₑ)
Kc = √(L₁ₑ) * √(L₂ₑ) / √(L₁ₑ * β + L₂ₑ * α)
Nc = √(L₂ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₁ₑ)
μ₄ = -K * √(L₂ₑ) * M * N * α / (L₁ₑ^(3 / 2) * √(L₁ₑ * β + L₂ₑ * α))
μ₅ = -K * √(L₁ₑ) * M * N * β / (L₂ₑ^(3 / 2) * √(L₁ₑ * β + L₂ₑ * α))These solutions are indepedent of matrix size. The approach we'll take here is solving for Nc, Kc, and then finally Mc one after the other, incorporating sizes.
Starting with N, we check how many iterations would be implied by Nc, and then choose the smallest value that would yield that number of iterations. This also ensures that Nc ≤ N.
Niter = cld(N, √(L₂ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₁ₑ))
Nblock, Nrem = divrem(N, Niter)
Nblock_Nrem = Nblock + (Nrem > 0)We have Nrem iterations of size Nblock_Nrem, and Niter - Nrem iterations of size Nblock.
We can now make Nc = Nblock_Nrem a constant, and solve the remaining three equations again:
0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₄ * Mc - μ₅ * Ncm
0 = ∂L / ∂μ₄ = L₂ₑ - Mc * Kcyielding
Mc = √(L₁ₑ) * √(α)
Kc = √(L₁ₑ) / √(α)
μ₄ = -K * M * N * √(α) / L₁ₑ^(3 / 2)We proceed in the same fashion as for Nc, being sure to reapply the Kc * Nc ≤ L₂ₑ constraint:
Kiter = cld(K, min(√(L₁ₑ) / √(α), L₂ₑ / Nc))
Kblock, Krem = divrem(K, Ki)
Kblock_Krem = Kblock + (Krem > 0)This leaves Mc partitioning, for which, for which we use the constraint Mc * Kc ≤ L₁ₑ to set the initial number of proposed iterations as cld(M, L₁ₑ / Kcm) for calling split_m. # approximate ceil
Mbsize, Mrem, Mremfinal, Mblocks = split_m(M, cld(M, L₁ₑ / Kcm), StaticInt{W}())Note that for synchronization on B, all threads must have the same values for Kc and Nc. K and N will be equal between threads, but M may differ. By calculating Kc and Nc independently of M, this algorithm guarantees all threads are on the same page.
Octavian.split_m — Methodsplit_m(M, Miters_base, W)Splits M into at most Miters_base iterations. For example, if we wish to divide 517 iterations into roughly 7 blocks using multiples of 8:
julia> split_m(517, 7, 8)
(72, 2, 69, 7)This suggests we have base block sizes of size 72, with two iterations requiring an extra remainder of 8 ( = W), and a final block of 69 to handle the remainder. It also tells us that there are 7 total iterations, as requested.
julia> 80 * 2 + 72 * (7 - 2 - 1) + 69
517This is meant to specify roughly the requested amount of blocks, and return relatively even sizes.
This method is used fairly generally.