← All Research
[Research Publication — April 2026]

BHDR: A 43× Rotation Reduction for Encrypted Kernel SHAP via Offline-Precomputed Regression Matrix

BHDR = BSGS-Hoisted Diagonal Regression  ·  BSGS = Baby-Step Giant-Step

Kernel SHAP under CKKS fully homomorphic encryption, end-to-end, non-interactive, single-server. The server never sees the client's input, the prediction, or the explanation in plaintext. The closing weighted-least-squares regression — previously 84% of the pipeline at 7.7s — is reduced to 0.6s by shifting the regression matrix M = (ZᵀWZ)⁻¹ ZᵀW offline (it depends only on public sampling parameters) and evaluating M · enc(y) via a BSGS-hoisted diagonal matvec on a K′-periodic replicate encoding. Rotations drop from ~2,200 to 51 at d = 50 — a 43× reduction and a 4.4× speedup on the full pipeline.

// Threat model: honest-but-curious single-server adversary. Server follows the protocol, attempts to learn from ciphertexts. Integrity against malicious servers and output privacy against curious decryption oracles are treated in companion work (Freivalds/LCV-IPA; IND-CPAD flooding).
// Status: pre-print. Systems companion to the efficiency-axiom numerical-correction paper. Full paper available on request; ePrint posting upon submission. Reference implementation and exp11 harness available under NDA; artifact-evaluation release in preparation.

Author Bader Alissaei
Date April 2026
Affiliation VaultBytes Engineering Ltd
Topic FHE · CKKS · Kernel SHAP · Explainability · Privacy-Preserving ML
Keywords BSGS · Diagonal Matvec · SIMD Packing · Matrix Bernstein · OCTE

What We Built

Two regulatory regimes — protect personal data, and explain consequential decisions — were on a collision course. BHDR resolves it.

01

GDPR protects the data. The EU AI Act demands explanations.

GDPR, HIPAA, CCPA require personal data to be protected during processing. The EU AI Act (Article 13 transparency; Article 86 right to explanation) requires high-risk systems to provide feature-level explanations for consequential decisions — effective August 2026 for high-risk deployments. Plaintext SHAP on personal data violates the first. Standard FHE inference satisfies the first but can't produce attributions. Today no off-the-shelf pipeline does both.

02

Kernel SHAP under FHE had not been built, and is expensive

We are not aware of any prior single-server FHE construction that computes Kernel SHAP end-to-end without decrypting the input (we surveyed the Nguyen et al. 2024 privacy-preserving XAI review, IACR ePrint, and the proceedings of CCS, S&P, USENIX Security, NeurIPS, ICML through early 2026). XorSHAP (MPC) and gradient-based FHE attribution exist; Shapley-based FHE explanations do not. The reason is cost: a natural port of Kernel SHAP's closing weighted-least-squares regression consumes ~2,200 ciphertext rotations per query at d = 50 on the production binding path, since every output feature wants its own dot product over K packed slots. That puts the regression at multiple seconds per query, dominating the entire pipeline.

03

The key insight: the regression matrix is public

Kernel SHAP's closing step is φ = M · y, where M = (ZᵀWZ)⁻¹ ZᵀW and Z, W are the public coalition sampling matrix and kernel weights. M depends on nothing secret — so we compute it once, offline, in plaintext. At encryption time only the encrypted coalition outputs y remain on the FHE side, turning an expensive encrypted linear system into a plaintext-matrix × encrypted-vector product. Prior MPC-based Shapley constructions (HESV, VLDB 2023) have separately exploited the public structure of SHAP regression weights; our contribution is the combination with BSGS-hoisted diagonal matvec and K′-periodic replicate encoding, which together unlock single-server FHE deployment.

04

BHDR: hoist the diagonals, then BSGS

We encode M on the diagonals of a K′-periodic replicated slot layout (K′ = 512 ≥ K = 390) and evaluate M · enc(y) with a Baby-Step Giant-Step (BSGS) split of r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting — one shared KeySwitchDown for the whole group. The total rotation budget collapses to 51 amortized keys: 31 baby-step, 15 giant-step, 5 replication doublings. A 43× reduction against the production baseline; 381× against a naïve d(K−1) bound.

// The core insight

The expensive part of Kernel SHAP — computing (ZᵀWZ)⁻¹ ZᵀW — uses only the coalition sampling matrix and the kernel weights. Both are public. Neither depends on the input, the model output, or the user's identity. So we do the whole matrix inversion once, in plaintext, before any ciphertext exists. At query time the server only needs a single matrix-vector product: a known plaintext matrix times an encrypted vector. We then evaluate that product with BSGS-hoisted diagonals, where d output coordinates share one set of rotations instead of each paying for its own. Two independent ideas — public-parameter offloading and structured matvec — compose into a pipeline speedup that makes the entire construction deployable.

// BHDR pipeline — what happens where
Offline (plaintext)
Public params
d, K, seed
Sample Z, W
coalitions + weights
M = (ZᵀWZ)⁻¹ ZᵀW
plaintext pre-compute
Diagonals of M
K′-periodic packing

Client (secret key)
Input x
clipped to [−R, R]ᵈ
CKKS encrypt
enc(x), enc(baseline)
Send to server
ciphertext only

Server (honest-but-curious)
Coalition mask
Z · enc(x)
Model eval
f(enc(·)) per coalition
BHDR matvec
51 rotations total
enc(φ)
SHAP ciphertext

Client (secret key)
Decrypt enc(φ)
recover SHAP values
Display
per-feature attribution
Offline (public, one-shot) Client (holds secret key) Server (ciphertext only)
// Rotation budget for the closing regression (d = 50, K = 390, K′ = 512)
Naïve bound
d × (K−1)
one row per output
19,450 rotations
unshipped worst case

v1 production
Row-wise BSGS
per-output batch
~2,200 rotations
7.70 s on M1 ⚠

BHDR
Hoisted diagonals
r₁ = 32, r₂ = 16
51 rotations
0.60 s on M1 ✓
=
43× reduction
vs. production

The Numbers

43×
Rotation reduction
vs. v1 production BSGS
51
Total rotations
for full d = 50 regression
2.1 s
Algorithm-level
M1 regression latency
12.7 s
End-to-end p50
(N = 100 UCI Adult, M1)
0 / 100
Silent CKKS overflows
post input-guard fix
// Pipeline latency breakdown — Apple M1, d = 50, K = 390 (lower is better)
v1 naïve regression
0.92
0.54
7.70 s regression
9.2 s
v2 BHDR regression
0.92
0.54
0.60
2.1 s
Coalition masking (200) Packed model eval (300) v1 regression (400) BHDR regression (400)

The BHDR-hoisted diagonal regression collapses the regression step from ~84% of the v1 pipeline (7.70 s) to ~29% of the v2 pipeline (0.60 s). Coalition masking and model evaluation are unchanged by BHDR; the bottleneck now sits in the model-evaluation stage, setting up the next optimisation target. Overall pipeline speedup: 4.4×.

// End-to-end encrypted pipeline (d = 50, K = 390, K′ = 512, 128-bit CKKS)
Component Apple M1 · v1 Apple M1 · v2 (BHDR) Hetzner CX22 · v1 Hetzner CX22 · v2 (BHDR)
Coalition masking (200) 0.92 s 0.92 s ~5 s ~5 s
Packed model eval (300) 0.54 s 0.54 s ~30 s ~30 s
Regression (400) 7.70 s 0.60 s ~60 s ~10–15 s†
Total pipeline 9.2 s 2.1 s ~95 s ~45–50 s†

† Projected from 5.3 s (mean) / 5.9 s (p95) BHDR regression measured on a 4-vCPU AMD EPYC research server, scaled by the empirical ~2× CX22/EPYC ratio. M1 totals omit per-query key regeneration and encrypted-baseline round-trip. The wider d = 50 exp11 run including those overheads over N = 100 UCI Adult queries reports p50 = 12.7 s, p95 = 28.0 s, p99 = 49.0 s, mean = 15.6 s (±7.2 s) on the same host — with the >10 s p50/p95 spread reflecting laptop-class thermal cycling. Warm-cache deployments that reuse the encrypted baseline across queries recover the 2.1 s headline figure.

// OCTE tree-ensemble feasibility study (Hetzner CX22, measured end-to-end)
OCTE · T = 100, D = 4
~53 s · passes gate
FEASIBLE
OCTE · T = 100, D = 6
fails accuracy gate · noise-budget exhausted at N = 2¹⁶
UNREACHABLE

The Oblivious Coalition Tree Evaluator (OCTE) evaluates all 2ᴰ tree paths simultaneously using a tanh–Chebyshev sign surrogate at ~5 CKKS levels, path indicator products at ⌈log₂ D⌉ levels, and plaintext leaf aggregation. D = 4 deploys; D = 6 fails the accuracy gate due to noise-budget exhaustion at N = 2¹⁶. A composite minimax sign gate and GPU rotation acceleration (CAT framework reports 68–74×) remain in-tree experimental paths.

The core idea in pseudocode — offline M, BSGS-hoisted matvec (matches Algorithm 1 in the paper)

Once. Up front. Plaintext. Then at query time the server does 51 rotations on a single encrypted vector to produce the full d-dimensional SHAP attribution. Rotation budget breakdown: 5 replication doublings + 31 baby-step + 15 giant-step = 51.

/* === Offline, plaintext, run once per (d, K, seed) === */
def precompute_diagonals(d, K, seed, Kp=512, r1=32, r2=16):
    Z = sample_coalitions(d, K, seed) # {0,1}^{K×d}
    W = kernel_weights(Z) # Lundberg–Lee
    M = inv(Z.T @ W @ Z) @ Z.T @ W # d × K, plaintext
    M_pad = pad_columns(M, target=Kp) # K → K′ = 512
    diags = halevi_shoup_diagonals(M_pad) # K′ diagonals
    return pre_rotate_giantstep(diags, r1, r2) # p'[j,i], tiled K′-periodic

/* === Online, per query, on encrypted enc(y) — Algorithm 1 === */
def bhdr_regression(enc_y, pdiags, r1=32, r2=16, Kp=512, n=16384):
    # Step 1: K′-periodic replication — tile enc(y) with period K′
    # via ⌈log₂(n/K′)⌉ = 5 rotate-and-double operations.
    enc_y_rep = enc_y
    for s in range(5): # offsets K′, 2K′, 4K′, 8K′, 16K′
        enc_y_rep = enc_y_rep + EvalRotate(enc_y_rep, Kp << s)

    # Step 2: baby-step precompute (level-1 hoisting — shared digits).
    digits = EvalFastRotationPrecompute(enc_y_rep)

    # Step 3: baby-step rotations — 31 fast rotations,
    # kept in extended CRT basis (level-2 hoisting).
    ct_ext = [EvalFastRotationExt(enc_y_rep, i, digits)
              for i in range(r1)] # i = 0, ..., r1−1

    # Step 4: giant-step accumulation in extended basis,
    # then KeySwitchDown per giant step.
    acc = [None] * r2
    for j in range(r2):
        acc_j = sum(pdiags[j][i] * ct_ext[i] for i in range(r1))
        acc[j] = KeySwitchDown(acc_j)

    # Step 5: giant-step rotation and sum — 15 EvalRotate calls.
    result = sum(EvalRotate(acc[j], j * r1) for j in range(r2))

    # Step 6: rescale — 1 multiplicative depth level consumed.
    return Rescale(result) # enc(φ), 5 + 31 + 15 = 51 rotations total

Cite This Work

// BibTeX @misc{alissaei2026bhdr,
  author       = {Bader Alissaei},
  title        = {{BHDR}: {BSGS}-Hoisted Diagonal Regression for
                   Non-Interactive Single-Server Kernel {SHAP} under {CKKS}},
  year         = {2026},
  note         = {Available on request from b@vaultbytes.com},
}