BHDR = BSGS-Hoisted Diagonal Regression · BSGS = Baby-Step Giant-Step
BHDR computes Kernel SHAP attributions under CKKS fully homomorphic encryption on a single server, non-interactively. The client's input, the prediction, and the explanation all stay encrypted on the server. The closing weighted-least-squares regression was 84% of the pipeline cost at 7.7s. Observation: the regression matrix M = (ZᵀWZ)⁻¹ ZᵀW depends only on public sampling parameters, so we precompute it offline in plaintext, then evaluate M · enc(y) with a BSGS-hoisted diagonal matvec over a K′-periodic replicate encoding. Rotations drop from ~2,200 to 51 at d = 50. Regression time drops to 0.6s. The full pipeline is 4.4× faster.
// Threat model: honest-but-curious single-server adversary. Server follows the protocol, attempts to learn from ciphertexts. Integrity against malicious servers and output privacy against curious decryption oracles are treated in companion work (Freivalds/LCV-IPA; IND-CPAD flooding).
// Technical report on Zenodo (v2, latest): 10.5281/zenodo.19889993. Reference implementation distributed separately under AGPL-3.0; request artifact access for review.
GDPR requires personal data to stay protected during processing. The EU AI Act introduces transparency obligations (Article 13) for high-risk systems and explanation rights (Article 86) in specified circumstances. A regulated-ML stack may need both privacy-preserving inference and feature-level explanations; BHDR explores one technical path for combining them.
GDPR, HIPAA, and CCPA require personal data to be protected during processing. The EU AI Act adds transparency obligations under Article 13 for high-risk systems, plus explanation rights under Article 86 for decisions affecting natural persons in specified circumstances; high-risk obligations take effect from August 2026. Computing SHAP on plaintext keeps the data visible at the server. Running standard FHE inference protects the data but produces only predictions, not attributions. BHDR is one technical path that handles both stages of the pipeline under encryption; whether it satisfies any specific compliance regime is a legal determination that depends on the deployment context.
We surveyed the Nguyen et al. 2024 privacy-preserving XAI review, IACR ePrint, and the proceedings of CCS, S&P, USENIX Security, NeurIPS, and ICML through early 2026, and did not find a single-server FHE construction that computes Kernel SHAP end-to-end without decrypting the input. XorSHAP (MPC) and gradient-based FHE attribution exist; Shapley-based FHE explanations do not. The reason is cost. A natural port of Kernel SHAP's closing weighted-least-squares regression consumes around 2,200 ciphertext rotations per query at d = 50 on our production binding path, because every output feature wants its own dot product over K packed slots. That puts the regression at multiple seconds per query, dominating the rest of the pipeline.
Kernel SHAP's closing step is φ = M · y, where M = (ZᵀWZ)⁻¹ ZᵀW, and both Z and W are public: Z is the coalition sampling matrix, W holds the kernel weights. M depends on no secret. We compute it once, offline, in plaintext. At query time only the coalition outputs y are encrypted, and the regression becomes a plaintext-matrix × encrypted-vector product. HESV (VLDB 2023) made a similar observation in the MPC setting. Our contribution is combining it with a BSGS-hoisted diagonal matvec and K′-periodic replicate encoding so that the matvec works on a single server under CKKS.
We encode M on the diagonals of a K′-periodic replicated slot layout (K′ = 512 ≥ K = 390) and evaluate M · enc(y) with a Baby-Step Giant-Step split of r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting: one shared KeySwitchDown for the whole group. Total rotation budget: 31 baby-step, 15 giant-step, 5 replication doublings, for 51 amortized keys. That is 43× below the production BSGS baseline, and 381× below the textbook d(K−1) bound.
The expensive part of Kernel SHAP is (ZᵀWZ)⁻¹ ZᵀW. Both Z and W are public. Neither depends on the input, the prediction, or the user. We do the inversion once, in plaintext, before any ciphertext exists. The server then runs a single matrix-vector product per query: a known plaintext matrix against an encrypted vector. We evaluate that product with BSGS-hoisted diagonals so that d output coordinates share one group of rotations instead of each paying for its own. Public-parameter offloading alone gets you the offline cost. The hoisted matvec is what drops online cost from seconds to sub-second.
The BHDR-hoisted diagonal regression collapses the regression step from ~84% of the v1 pipeline (7.70 s) to ~29% of the v2 pipeline (0.60 s). Coalition masking and model evaluation are unchanged by BHDR; the bottleneck now sits in the model-evaluation stage, setting up the next optimisation target. Overall pipeline speedup: 4.4×.
| Component | Apple M1 · v1 | Apple M1 · v2 (BHDR) | 2 vCPU x86 · v1 | 2 vCPU x86 · v2 (BHDR) |
|---|---|---|---|---|
| Coalition masking (200) | 0.92 s | 0.92 s | ~5 s | ~5 s |
| Packed model eval (300) | 0.54 s | 0.54 s | ~30 s | ~30 s |
| Regression (400) | 7.70 s | 0.60 s | ~60 s | ~10–15 s† |
| Total pipeline | 9.2 s | 2.1 s | ~95 s | ~45–50 s† |
† Projected from 5.3 s (mean) / 5.9 s (p95) BHDR regression measured on a 4 vCPU AMD EPYC research server, scaled by the empirical ~2× x86/EPYC ratio. The 9.2 s → 2.1 s totals are a narrow-d, warm-cache algorithm-level ablation (key setup amortized, encrypted baseline reused). The deployed end-to-end pipeline at d = 50 over Nq = 300 UCI Adult queries on Apple M1 measures p50 = 13.4 s, p95 = 16.3 s, p99 = 24.2 s, mean = 14.1 s (±4.0 s), with 0/300 silent CKKS overflows after the input-guard fix (Wilson 95% upper bound 1.26%) and 0 MB resident-memory growth.
The Oblivious Coalition Tree Evaluator (OCTE) evaluates all 2ᴰ tree paths in parallel. It uses a tanh–Chebyshev sign surrogate at about 5 CKKS levels, path-indicator products at ⌈log₂ D⌉ levels, and plaintext leaf aggregation. D = 4 deploys. D = 6 is a diagnostic — it passes the SHAP accuracy gate at L∞ = 7.9 × 10⁻³ but is not deployed: it consumes all available multiplicative levels and the wall-clock latency is too high. The binding constraint at D = 6 is latency and depth margin, not accuracy. A composite minimax sign gate and GPU rotation acceleration (the CAT framework reports 68–74× speedups on RTX 4090) are experimental paths we have not deployed.
M = (ZᵀWZ)⁻¹ ZᵀW depends only on the public coalition sampling matrix Z and kernel weights W. It carries no input, prediction, or secret. Inversion runs once, in plaintext, before any ciphertext exists.EvalFastRotationExt with level-2 hoisting and no per-rotation KeySwitchDown. Giant-step rotations use EvalRotate, which is about 2.7× more expensive per call, but there are only 15 of them. Total: 31 baby + 15 giant + 5 replication doublings = 51 amortized rotation keys.The offline function runs once per (d, K, seed). The online function runs per query on a single encrypted vector and uses 51 rotations total, partitioned as 5 replication doublings + 31 baby-step + 15 giant-step.
Technical report archived on Zenodo (latest, v2): 10.5281/zenodo.19889993.
Patent notice.
The methods described in this technical report are related to pending PCT patent application PCT/IB2026/053405, held by VaultBytes Innovations Ltd. The report is licensed under CC BY 4.0 for text, figures, and tables only. No patent license or implementation right is granted by publication of this report. Implementation or commercial use of patented or patent-pending methods may require a separate license.
Software notice. The reference implementation is distributed separately under AGPL-3.0. The software license applies to the code repository only; the publication license for the report applies only to the report text, figures, and tables.