← All Research

[Research Publication — April 2026]

BHDR: A 43× Rotation Reduction for Encrypted Kernel SHAP via Offline-Precomputed Regression Matrix

BHDR = BSGS-Hoisted Diagonal Regression · BSGS = Baby-Step Giant-Step

Kernel SHAP under CKKS fully homomorphic encryption, end-to-end, non-interactive, single-server. The server never sees the client's input, the prediction, or the explanation in plaintext. The closing weighted-least-squares regression — previously 84% of the pipeline at 7.7s — is reduced to 0.6s by shifting the regression matrix M = (ZᵀWZ)⁻¹ ZᵀW offline (it depends only on public sampling parameters) and evaluating M · enc(y) via a BSGS-hoisted diagonal matvec on a K′-periodic replicate encoding. Rotations drop from ~2,200 to 51 at d = 50 — a 43× reduction and a 4.4× speedup on the full pipeline.

// Threat model: honest-but-curious single-server adversary. Server follows the protocol, attempts to learn from ciphertexts. Integrity against malicious servers and output privacy against curious decryption oracles are treated in companion work (Freivalds/LCV-IPA; IND-CPA^D flooding).
// Status: pre-print. Systems companion to the efficiency-axiom numerical-correction paper. Full paper available on request; ePrint posting upon submission. Reference implementation and exp11 harness available under NDA; artifact-evaluation release in preparation.

[view_cipherexplain] [request_paper]

Author Bader Alissaei

Date April 2026

Affiliation VaultBytes Engineering Ltd

Topic FHE · CKKS · Kernel SHAP · Explainability · Privacy-Preserving ML

Keywords BSGS · Diagonal Matvec · SIMD Packing · Matrix Bernstein · OCTE

What We Built

Two regulatory regimes — protect personal data, and explain consequential decisions — were on a collision course. BHDR resolves it.

GDPR protects the data. The EU AI Act demands explanations.

GDPR, HIPAA, CCPA require personal data to be protected during processing. The EU AI Act (Article 13 transparency; Article 86 right to explanation) requires high-risk systems to provide feature-level explanations for consequential decisions — effective August 2026 for high-risk deployments. Plaintext SHAP on personal data violates the first. Standard FHE inference satisfies the first but can't produce attributions. Today no off-the-shelf pipeline does both.

Kernel SHAP under FHE had not been built, and is expensive

We are not aware of any prior single-server FHE construction that computes Kernel SHAP end-to-end without decrypting the input (we surveyed the Nguyen et al. 2024 privacy-preserving XAI review, IACR ePrint, and the proceedings of CCS, S&P, USENIX Security, NeurIPS, ICML through early 2026). XorSHAP (MPC) and gradient-based FHE attribution exist; Shapley-based FHE explanations do not. The reason is cost: a natural port of Kernel SHAP's closing weighted-least-squares regression consumes ~2,200 ciphertext rotations per query at d = 50 on the production binding path, since every output feature wants its own dot product over K packed slots. That puts the regression at multiple seconds per query, dominating the entire pipeline.

The key insight: the regression matrix is public

Kernel SHAP's closing step is φ = M · y, where M = (ZᵀWZ)⁻¹ ZᵀW and Z, W are the public coalition sampling matrix and kernel weights. M depends on nothing secret — so we compute it once, offline, in plaintext. At encryption time only the encrypted coalition outputs y remain on the FHE side, turning an expensive encrypted linear system into a plaintext-matrix × encrypted-vector product. Prior MPC-based Shapley constructions (HESV, VLDB 2023) have separately exploited the public structure of SHAP regression weights; our contribution is the combination with BSGS-hoisted diagonal matvec and K′-periodic replicate encoding, which together unlock single-server FHE deployment.

BHDR: hoist the diagonals, then BSGS

We encode M on the diagonals of a K′-periodic replicated slot layout (K′ = 512 ≥ K = 390) and evaluate M · enc(y) with a Baby-Step Giant-Step (BSGS) split of r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting — one shared KeySwitchDown for the whole group. The total rotation budget collapses to 51 amortized keys: 31 baby-step, 15 giant-step, 5 replication doublings. A 43× reduction against the production baseline; 381× against a naïve d(K−1) bound.

// The core insight

The expensive part of Kernel SHAP — computing (ZᵀWZ)⁻¹ ZᵀW — uses only the coalition sampling matrix and the kernel weights. Both are public. Neither depends on the input, the model output, or the user's identity. So we do the whole matrix inversion once, in plaintext, before any ciphertext exists. At query time the server only needs a single matrix-vector product: a known plaintext matrix times an encrypted vector. We then evaluate that product with BSGS-hoisted diagonals, where d output coordinates share one set of rotations instead of each paying for its own. Two independent ideas — public-parameter offloading and structured matvec — compose into a pipeline speedup that makes the entire construction deployable.

// BHDR pipeline — what happens where

Offline (plaintext)

Public params

d, K, seed

→

Sample Z, W

coalitions + weights

→

M = (ZᵀWZ)⁻¹ ZᵀW
plaintext pre-compute

→

Diagonals of M

K′-periodic packing

Client (secret key)

Input x

clipped to [−R, R]ᵈ

→

CKKS encrypt

enc(x), enc(baseline)

→

Send to server

ciphertext only

Server (honest-but-curious)

Coalition mask

Z · enc(x)

→

Model eval

f(enc(·)) per coalition

→

BHDR matvec
51 rotations total

→

enc(φ)

SHAP ciphertext

Client (secret key)

Decrypt enc(φ)

recover SHAP values

→

Display

per-feature attribution

Offline (public, one-shot) Client (holds secret key) Server (ciphertext only)

// Rotation budget for the closing regression (d = 50, K = 390, K′ = 512)

Naïve bound

d × (K−1)

one row per output

→

19,450 rotations

unshipped worst case

v1 production

Row-wise BSGS

per-output batch

→

~2,200 rotations

7.70 s on M1 ⚠

BHDR

Hoisted diagonals

r₁ = 32, r₂ = 16

→

51 rotations

0.60 s on M1 ✓

43× reduction

vs. production

The Numbers

43×

Rotation reduction
vs. v1 production BSGS

Total rotations
for full d = 50 regression

2.1 s

Algorithm-level
M1 regression latency

12.7 s

End-to-end p50
(N = 100 UCI Adult, M1)

0 / 100

Silent CKKS overflows
post input-guard fix

// Pipeline latency breakdown — Apple M1, d = 50, K = 390 (lower is better)

v1 naïve regression

0.92

0.54

7.70 s regression

9.2 s

v2 BHDR regression

0.92

0.54

0.60

2.1 s

Coalition masking (200) Packed model eval (300) v1 regression (400) BHDR regression (400)

The BHDR-hoisted diagonal regression collapses the regression step from ~84% of the v1 pipeline (7.70 s) to ~29% of the v2 pipeline (0.60 s). Coalition masking and model evaluation are unchanged by BHDR; the bottleneck now sits in the model-evaluation stage, setting up the next optimisation target. Overall pipeline speedup: 4.4×.

// End-to-end encrypted pipeline (d = 50, K = 390, K′ = 512, 128-bit CKKS)

Component	Apple M1 · v1	Apple M1 · v2 (BHDR)	Hetzner CX22 · v1	Hetzner CX22 · v2 (BHDR)
Coalition masking (200)	0.92 s	0.92 s	~5 s	~5 s
Packed model eval (300)	0.54 s	0.54 s	~30 s	~30 s
Regression (400)	7.70 s	0.60 s	~60 s	~10–15 s†
Total pipeline	9.2 s	2.1 s	~95 s	~45–50 s†

† Projected from 5.3 s (mean) / 5.9 s (p95) BHDR regression measured on a 4-vCPU AMD EPYC research server, scaled by the empirical ~2× CX22/EPYC ratio. M1 totals omit per-query key regeneration and encrypted-baseline round-trip. The wider d = 50 exp11 run including those overheads over N = 100 UCI Adult queries reports p50 = 12.7 s, p95 = 28.0 s, p99 = 49.0 s, mean = 15.6 s (±7.2 s) on the same host — with the >10 s p50/p95 spread reflecting laptop-class thermal cycling. Warm-cache deployments that reuse the encrypted baseline across queries recover the 2.1 s headline figure.

// OCTE tree-ensemble feasibility study (Hetzner CX22, measured end-to-end)

OCTE · T = 100, D = 4

~53 s · passes gate

FEASIBLE

OCTE · T = 100, D = 6

fails accuracy gate · noise-budget exhausted at N = 2¹⁶

UNREACHABLE

The Oblivious Coalition Tree Evaluator (OCTE) evaluates all 2ᴰ tree paths simultaneously using a tanh–Chebyshev sign surrogate at ~5 CKKS levels, path indicator products at ⌈log₂ D⌉ levels, and plaintext leaf aggregation. D = 4 deploys; D = 6 fails the accuracy gate due to noise-budget exhaustion at N = 2¹⁶. A composite minimax sign gate and GPU rotation acceleration (CAT framework reports 68–74×) remain in-tree experimental paths.

The regression matrix M = (ZᵀWZ)⁻¹ ZᵀW depends only on the public coalition sampling matrix Z and kernel weights W — no input, no prediction, no secret. Inversion is done once, in plaintext, before any ciphertext exists.
BSGS split r₁ = 32 baby steps, r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting and no per-rotation KeySwitchDown; giant-step rotations use standard EvalRotate (~2.7× more expensive but only 15 of them). Total: 31 baby + 15 giant + 5 replication doublings = 51 amortized rotation keys.
Coalition count scales as K = O(d log d) via the Covert–Lee quasi-random antithetic design, with K = 390 at d = 50. For d ≤ 10 we enumerate exactly (K = 2ᵈ − 2); for d ≥ 15 the quasi-random design yields 2–10× variance reduction, translating to √2–√10× fewer samples than a uniform Hoeffding design.
Approximation is proved via a pair-level Matrix Bernstein argument whose log²(d) slack absorbs the measured λ_min(Ā) ~ 1/log d scaling of the Lundberg–Lee information matrix — giving a bound √d / log d tighter than the Hoeffding union bound for the same (ε, δ).
FHE-vs-plaintext element-wise error ≤ 1.35 × 10⁻⁴ on the encrypted path. Axiom delta between FHE and plaintext paths is 1.65 × 10⁻⁵. After redistribution, efficiency-axiom error drops to 1.16 × 10⁻¹⁶ (machine epsilon) in both paths.
OpenFHE/CKKS parameters: N = 2¹⁵, L = 10, 128-bit security for logistic regression. N = 2¹⁶ for OCTE (deeper modulus chain for comparison polynomials). End-to-end exp11 run over N = 100 UCI Adult queries: 0 / 100 silent CKKS overflows post input-guard fix, resident-memory growth 0 MB.
Confidentiality against an honest-but-curious server: the server observes ciphertexts only. Input, prediction, and explanation all remain encrypted. Output privacy (IND-CPA^D flooding) and integrity against a malicious server (Freivalds / LCV-IPA) are addressed in companion work.
OCTE at D = 4, T = 100 measures ~53 s end-to-end on single-threaded CPU OpenFHE (Hetzner CX22). D = 6 fails the SHAP accuracy gate at N = 2¹⁶ due to noise-budget exhaustion. MLP support is not claimed in this paper.
Companion work addresses (a) the Kernel SHAP efficiency-axiom-vs-ranking-stability trade-off from post-hoc residual redistribution (applies to any Kernel SHAP deployment, encrypted or not), and (b) the cryptographic integrity layer against a malicious server. Both are out of scope here — this paper is strictly the systems + approximation feasibility result.

The core idea in pseudocode — offline M, BSGS-hoisted matvec (matches Algorithm 1 in the paper)

Once. Up front. Plaintext. Then at query time the server does 51 rotations on a single encrypted vector to produce the full d-dimensional SHAP attribution. Rotation budget breakdown: 5 replication doublings + 31 baby-step + 15 giant-step = 51.

/* === Offline, plaintext, run once per (d, K, seed) === */

def precompute_diagonals(d, K, seed, Kp=512, r1=32, r2=16):

    Z = sample_coalitions(d, K, seed)            # {0,1}^{K×d}

    W = kernel_weights(Z)                        # Lundberg–Lee

    M = inv(Z.T @ W @ Z) @ Z.T @ W               # d × K, plaintext

    M_pad = pad_columns(M, target=Kp)            # K → K′ = 512

    diags = halevi_shoup_diagonals(M_pad)        # K′ diagonals

    return pre_rotate_giantstep(diags, r1, r2) # p'[j,i], tiled K′-periodic

/* === Online, per query, on encrypted enc(y) — Algorithm 1 === */

def bhdr_regression(enc_y, pdiags, r1=32, r2=16, Kp=512, n=16384):

    # Step 1: K′-periodic replication — tile enc(y) with period K′

    #         via ⌈log₂(n/K′)⌉ = 5 rotate-and-double operations.

    enc_y_rep = enc_y

    for s in range(5):                        # offsets K′, 2K′, 4K′, 8K′, 16K′

        enc_y_rep = enc_y_rep + EvalRotate(enc_y_rep, Kp << s)

    # Step 2: baby-step precompute (level-1 hoisting — shared digits).

    digits = EvalFastRotationPrecompute(enc_y_rep)

    # Step 3: baby-step rotations — 31 fast rotations,

    #         kept in extended CRT basis (level-2 hoisting).

    ct_ext = [EvalFastRotationExt(enc_y_rep, i, digits)

              for i in range(r1)]    # i = 0, ..., r1−1

    # Step 4: giant-step accumulation in extended basis,

    #         then KeySwitchDown per giant step.

    acc = [None] * r2

    for j in range(r2):

        acc_j = sum(pdiags[j][i] * ct_ext[i] for i in range(r1))

        acc[j] = KeySwitchDown(acc_j)

    # Step 5: giant-step rotation and sum — 15 EvalRotate calls.

    result = sum(EvalRotate(acc[j], j * r1) for j in range(r2))

    # Step 6: rescale — 1 multiplicative depth level consumed.

    return Rescale(result)    # enc(φ), 5 + 31 + 15 = 51 rotations total