BHDR = BSGS-Hoisted Diagonal Regression · BSGS = Baby-Step Giant-Step
Kernel SHAP under CKKS fully homomorphic encryption, end-to-end, non-interactive, single-server. The server never sees the client's input, the prediction, or the explanation in plaintext. The closing weighted-least-squares regression — previously 84% of the pipeline at 7.7s — is reduced to 0.6s by shifting the regression matrix M = (ZᵀWZ)⁻¹ ZᵀW offline (it depends only on public sampling parameters) and evaluating M · enc(y) via a BSGS-hoisted diagonal matvec on a K′-periodic replicate encoding. Rotations drop from ~2,200 to 51 at d = 50 — a 43× reduction and a 4.4× speedup on the full pipeline.
// Threat model: honest-but-curious single-server adversary. Server follows the protocol, attempts to learn from ciphertexts. Integrity against malicious servers and output privacy against curious decryption oracles are treated in companion work (Freivalds/LCV-IPA; IND-CPAD flooding).
// Status: pre-print. Systems companion to the efficiency-axiom numerical-correction paper. Full paper available on request; ePrint posting upon submission. Reference implementation and exp11 harness available under NDA; artifact-evaluation release in preparation.
Two regulatory regimes — protect personal data, and explain consequential decisions — were on a collision course. BHDR resolves it.
GDPR, HIPAA, CCPA require personal data to be protected during processing. The EU AI Act (Article 13 transparency; Article 86 right to explanation) requires high-risk systems to provide feature-level explanations for consequential decisions — effective August 2026 for high-risk deployments. Plaintext SHAP on personal data violates the first. Standard FHE inference satisfies the first but can't produce attributions. Today no off-the-shelf pipeline does both.
We are not aware of any prior single-server FHE construction that computes Kernel SHAP end-to-end without decrypting the input (we surveyed the Nguyen et al. 2024 privacy-preserving XAI review, IACR ePrint, and the proceedings of CCS, S&P, USENIX Security, NeurIPS, ICML through early 2026). XorSHAP (MPC) and gradient-based FHE attribution exist; Shapley-based FHE explanations do not. The reason is cost: a natural port of Kernel SHAP's closing weighted-least-squares regression consumes ~2,200 ciphertext rotations per query at d = 50 on the production binding path, since every output feature wants its own dot product over K packed slots. That puts the regression at multiple seconds per query, dominating the entire pipeline.
Kernel SHAP's closing step is φ = M · y, where M = (ZᵀWZ)⁻¹ ZᵀW and Z, W are the public coalition sampling matrix and kernel weights. M depends on nothing secret — so we compute it once, offline, in plaintext. At encryption time only the encrypted coalition outputs y remain on the FHE side, turning an expensive encrypted linear system into a plaintext-matrix × encrypted-vector product. Prior MPC-based Shapley constructions (HESV, VLDB 2023) have separately exploited the public structure of SHAP regression weights; our contribution is the combination with BSGS-hoisted diagonal matvec and K′-periodic replicate encoding, which together unlock single-server FHE deployment.
We encode M on the diagonals of a K′-periodic replicated slot layout (K′ = 512 ≥ K = 390) and evaluate M · enc(y) with a Baby-Step Giant-Step (BSGS) split of r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting — one shared KeySwitchDown for the whole group. The total rotation budget collapses to 51 amortized keys: 31 baby-step, 15 giant-step, 5 replication doublings. A 43× reduction against the production baseline; 381× against a naïve d(K−1) bound.
The expensive part of Kernel SHAP — computing (ZᵀWZ)⁻¹ ZᵀW — uses only the coalition sampling matrix and the kernel weights. Both are public. Neither depends on the input, the model output, or the user's identity. So we do the whole matrix inversion once, in plaintext, before any ciphertext exists. At query time the server only needs a single matrix-vector product: a known plaintext matrix times an encrypted vector. We then evaluate that product with BSGS-hoisted diagonals, where d output coordinates share one set of rotations instead of each paying for its own. Two independent ideas — public-parameter offloading and structured matvec — compose into a pipeline speedup that makes the entire construction deployable.
The BHDR-hoisted diagonal regression collapses the regression step from ~84% of the v1 pipeline (7.70 s) to ~29% of the v2 pipeline (0.60 s). Coalition masking and model evaluation are unchanged by BHDR; the bottleneck now sits in the model-evaluation stage, setting up the next optimisation target. Overall pipeline speedup: 4.4×.
| Component | Apple M1 · v1 | Apple M1 · v2 (BHDR) | Hetzner CX22 · v1 | Hetzner CX22 · v2 (BHDR) |
|---|---|---|---|---|
| Coalition masking (200) | 0.92 s | 0.92 s | ~5 s | ~5 s |
| Packed model eval (300) | 0.54 s | 0.54 s | ~30 s | ~30 s |
| Regression (400) | 7.70 s | 0.60 s | ~60 s | ~10–15 s† |
| Total pipeline | 9.2 s | 2.1 s | ~95 s | ~45–50 s† |
† Projected from 5.3 s (mean) / 5.9 s (p95) BHDR regression measured on a 4-vCPU AMD EPYC research server, scaled by the empirical ~2× CX22/EPYC ratio. M1 totals omit per-query key regeneration and encrypted-baseline round-trip. The wider d = 50 exp11 run including those overheads over N = 100 UCI Adult queries reports p50 = 12.7 s, p95 = 28.0 s, p99 = 49.0 s, mean = 15.6 s (±7.2 s) on the same host — with the >10 s p50/p95 spread reflecting laptop-class thermal cycling. Warm-cache deployments that reuse the encrypted baseline across queries recover the 2.1 s headline figure.
The Oblivious Coalition Tree Evaluator (OCTE) evaluates all 2ᴰ tree paths simultaneously using a tanh–Chebyshev sign surrogate at ~5 CKKS levels, path indicator products at ⌈log₂ D⌉ levels, and plaintext leaf aggregation. D = 4 deploys; D = 6 fails the accuracy gate due to noise-budget exhaustion at N = 2¹⁶. A composite minimax sign gate and GPU rotation acceleration (CAT framework reports 68–74×) remain in-tree experimental paths.
M = (ZᵀWZ)⁻¹ ZᵀW depends only on the public coalition sampling matrix Z and kernel weights W — no input, no prediction, no secret. Inversion is done once, in plaintext, before any ciphertext exists.EvalFastRotationExt with level-2 hoisting and no per-rotation KeySwitchDown; giant-step rotations use standard EvalRotate (~2.7× more expensive but only 15 of them). Total: 31 baby + 15 giant + 5 replication doublings = 51 amortized rotation keys.Once. Up front. Plaintext. Then at query time the server does 51 rotations on a single encrypted vector to produce the full d-dimensional SHAP attribution. Rotation budget breakdown: 5 replication doublings + 31 baby-step + 15 giant-step = 51.