← All Research
[Research Publication — April 2026]

BHDR: A 43× Rotation Reduction for Encrypted Kernel SHAP

BHDR = BSGS-Hoisted Diagonal Regression  ·  BSGS = Baby-Step Giant-Step

BHDR computes Kernel SHAP attributions under CKKS fully homomorphic encryption on a single server, non-interactively. The client's input, the prediction, and the explanation all stay encrypted on the server. The closing weighted-least-squares regression was 84% of the pipeline cost at 7.7s. Observation: the regression matrix M = (ZᵀWZ)⁻¹ ZᵀW depends only on public sampling parameters, so we precompute it offline in plaintext, then evaluate M · enc(y) with a BSGS-hoisted diagonal matvec over a K′-periodic replicate encoding. Rotations drop from ~2,200 to 51 at d = 50. Regression time drops to 0.6s. The full pipeline is 4.4× faster.

// Threat model: honest-but-curious single-server adversary. Server follows the protocol, attempts to learn from ciphertexts. Integrity against malicious servers and output privacy against curious decryption oracles are treated in companion work (Freivalds/LCV-IPA; IND-CPAD flooding).
// Technical report on Zenodo (v2, latest): 10.5281/zenodo.19889993. Reference implementation distributed separately under AGPL-3.0; request artifact access for review.

Author Bader Alissaei
Date April 2026
Affiliation VaultBytes Innovations Ltd
Topic FHE · CKKS · Kernel SHAP · Explainability · Privacy-Preserving ML
Keywords BSGS · Diagonal Matvec · SIMD Packing · Matrix Bernstein · OCTE

What We Built

GDPR requires personal data to stay protected during processing. The EU AI Act introduces transparency obligations (Article 13) for high-risk systems and explanation rights (Article 86) in specified circumstances. A regulated-ML stack may need both privacy-preserving inference and feature-level explanations; BHDR explores one technical path for combining them.

01

Why regulated ML may want Shapley under FHE

GDPR, HIPAA, and CCPA require personal data to be protected during processing. The EU AI Act adds transparency obligations under Article 13 for high-risk systems, plus explanation rights under Article 86 for decisions affecting natural persons in specified circumstances; high-risk obligations take effect from August 2026. Computing SHAP on plaintext keeps the data visible at the server. Running standard FHE inference protects the data but produces only predictions, not attributions. BHDR is one technical path that handles both stages of the pipeline under encryption; whether it satisfies any specific compliance regime is a legal determination that depends on the deployment context.

02

Why no prior system did this

We surveyed the Nguyen et al. 2024 privacy-preserving XAI review, IACR ePrint, and the proceedings of CCS, S&P, USENIX Security, NeurIPS, and ICML through early 2026, and did not find a single-server FHE construction that computes Kernel SHAP end-to-end without decrypting the input. XorSHAP (MPC) and gradient-based FHE attribution exist; Shapley-based FHE explanations do not. The reason is cost. A natural port of Kernel SHAP's closing weighted-least-squares regression consumes around 2,200 ciphertext rotations per query at d = 50 on our production binding path, because every output feature wants its own dot product over K packed slots. That puts the regression at multiple seconds per query, dominating the rest of the pipeline.

03

Offline-precomputed regression matrix

Kernel SHAP's closing step is φ = M · y, where M = (ZᵀWZ)⁻¹ ZᵀW, and both Z and W are public: Z is the coalition sampling matrix, W holds the kernel weights. M depends on no secret. We compute it once, offline, in plaintext. At query time only the coalition outputs y are encrypted, and the regression becomes a plaintext-matrix × encrypted-vector product. HESV (VLDB 2023) made a similar observation in the MPC setting. Our contribution is combining it with a BSGS-hoisted diagonal matvec and K′-periodic replicate encoding so that the matvec works on a single server under CKKS.

04

BSGS-hoisted diagonal matvec

We encode M on the diagonals of a K′-periodic replicated slot layout (K′ = 512 ≥ K = 390) and evaluate M · enc(y) with a Baby-Step Giant-Step split of r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting: one shared KeySwitchDown for the whole group. Total rotation budget: 31 baby-step, 15 giant-step, 5 replication doublings, for 51 amortized keys. That is 43× below the production BSGS baseline, and 381× below the textbook d(K−1) bound.

// Summary

The expensive part of Kernel SHAP is (ZᵀWZ)⁻¹ ZᵀW. Both Z and W are public. Neither depends on the input, the prediction, or the user. We do the inversion once, in plaintext, before any ciphertext exists. The server then runs a single matrix-vector product per query: a known plaintext matrix against an encrypted vector. We evaluate that product with BSGS-hoisted diagonals so that d output coordinates share one group of rotations instead of each paying for its own. Public-parameter offloading alone gets you the offline cost. The hoisted matvec is what drops online cost from seconds to sub-second.

// BHDR pipeline — where each step runs
Offline (plaintext)
Public params
d, K, seed
Sample Z, W
coalitions + weights
M = (ZᵀWZ)⁻¹ ZᵀW
plaintext pre-compute
Diagonals of M
K′-periodic packing

Client (secret key)
Input x
clipped to [−R, R]ᵈ
CKKS encrypt
enc(x), enc(baseline)
Send to server
ciphertext only

Server (honest-but-curious)
Coalition mask
Z · enc(x)
Model eval
f(enc(·)) per coalition
BHDR matvec
51 rotations total
enc(φ)
SHAP ciphertext

Client (secret key)
Decrypt enc(φ)
recover SHAP values
Display
per-feature attribution
Offline (public, one-shot) Client (holds secret key) Server (ciphertext only)
// Rotation budget for the closing regression (d = 50, K = 390, K′ = 512)
Naïve bound
d × (K−1)
one row per output
19,450 rotations
unshipped worst case

v1 production
Row-wise BSGS
per-output batch
~2,200 rotations
7.70 s on M1 ⚠

BHDR
Hoisted diagonals
r₁ = 32, r₂ = 16
51 rotations
0.60 s on M1 ✓
=
43× reduction
vs. production

The Numbers

43×
Rotation reduction
vs. v1 production BSGS
51
Total rotations
for full d = 50 regression
0.6 s
BHDR regression step
(Apple M1)
13.4 s
Deployed end-to-end p50
(Nq = 300 UCI Adult, M1)
0 / 300
Silent CKKS overflows
post input-guard fix (Wilson UB 1.26%)
// Pipeline latency breakdown — Apple M1, d = 50, K = 390 (lower is better)
v1 naïve regression
0.92
0.54
7.70 s regression
9.2 s
v2 BHDR regression
0.92
0.54
0.60
2.1 s
Coalition masking (200) Packed model eval (300) v1 regression (400) BHDR regression (400)

The BHDR-hoisted diagonal regression collapses the regression step from ~84% of the v1 pipeline (7.70 s) to ~29% of the v2 pipeline (0.60 s). Coalition masking and model evaluation are unchanged by BHDR; the bottleneck now sits in the model-evaluation stage, setting up the next optimisation target. Overall pipeline speedup: 4.4×.

// End-to-end encrypted pipeline (d = 50, K = 390, K′ = 512, 128-bit CKKS)
Component Apple M1 · v1 Apple M1 · v2 (BHDR) 2 vCPU x86 · v1 2 vCPU x86 · v2 (BHDR)
Coalition masking (200) 0.92 s 0.92 s ~5 s ~5 s
Packed model eval (300) 0.54 s 0.54 s ~30 s ~30 s
Regression (400) 7.70 s 0.60 s ~60 s ~10–15 s†
Total pipeline 9.2 s 2.1 s ~95 s ~45–50 s†

† Projected from 5.3 s (mean) / 5.9 s (p95) BHDR regression measured on a 4 vCPU AMD EPYC research server, scaled by the empirical ~2× x86/EPYC ratio. The 9.2 s → 2.1 s totals are a narrow-d, warm-cache algorithm-level ablation (key setup amortized, encrypted baseline reused). The deployed end-to-end pipeline at d = 50 over Nq = 300 UCI Adult queries on Apple M1 measures p50 = 13.4 s, p95 = 16.3 s, p99 = 24.2 s, mean = 14.1 s (±4.0 s), with 0/300 silent CKKS overflows after the input-guard fix (Wilson 95% upper bound 1.26%) and 0 MB resident-memory growth.

// OCTE tree-ensemble feasibility study (measured end-to-end)
OCTE · T = 100, D = 4
~53 s on a 2 vCPU x86 cloud server · passes accuracy gate
DEPLOYED
OCTE · T = 100, D = 6
806 s on an 8-core AMD cloud server · L = 7.9 × 10⁻³
DIAGNOSTIC

The Oblivious Coalition Tree Evaluator (OCTE) evaluates all 2ᴰ tree paths in parallel. It uses a tanh–Chebyshev sign surrogate at about 5 CKKS levels, path-indicator products at ⌈log₂ D⌉ levels, and plaintext leaf aggregation. D = 4 deploys. D = 6 is a diagnostic — it passes the SHAP accuracy gate at L = 7.9 × 10⁻³ but is not deployed: it consumes all available multiplicative levels and the wall-clock latency is too high. The binding constraint at D = 6 is latency and depth margin, not accuracy. A composite minimax sign gate and GPU rotation acceleration (the CAT framework reports 68–74× speedups on RTX 4090) are experimental paths we have not deployed.

Pseudocode — offline matrix, BSGS-hoisted online matvec (matches Algorithm 1 in the paper)

The offline function runs once per (d, K, seed). The online function runs per query on a single encrypted vector and uses 51 rotations total, partitioned as 5 replication doublings + 31 baby-step + 15 giant-step.

/* === Offline, plaintext, run once per (d, K, seed) === */
def precompute_diagonals(d, K, seed, Kp=512, r1=32, r2=16):
    Z = sample_coalitions(d, K, seed) # {0,1}^{K×d}
    W = kernel_weights(Z) # Lundberg–Lee
    M = inv(Z.T @ W @ Z) @ Z.T @ W # d × K, plaintext
    M_pad = pad_columns(M, target=Kp) # K → K′ = 512
    diags = halevi_shoup_diagonals(M_pad) # K′ diagonals
    return pre_rotate_giantstep(diags, r1, r2) # p'[j,i], tiled K′-periodic

/* === Online, per query, on encrypted enc(y) — Algorithm 1 === */
def bhdr_regression(enc_y, pdiags, r1=32, r2=16, Kp=512, n=16384):
    # Step 1: K′-periodic replication — tile enc(y) with period K′
    # via ⌈log₂(n/K′)⌉ = 5 rotate-and-double operations.
    enc_y_rep = enc_y
    for s in range(5): # offsets K′, 2K′, 4K′, 8K′, 16K′
        enc_y_rep = enc_y_rep + EvalRotate(enc_y_rep, Kp << s)

    # Step 2: baby-step precompute (level-1 hoisting — shared digits).
    digits = EvalFastRotationPrecompute(enc_y_rep)

    # Step 3: baby-step rotations — 31 fast rotations,
    # kept in extended CRT basis (level-2 hoisting).
    ct_ext = [EvalFastRotationExt(enc_y_rep, i, digits)
              for i in range(r1)] # i = 0, ..., r1−1

    # Step 4: giant-step accumulation in extended basis,
    # then KeySwitchDown per giant step.
    acc = [None] * r2
    for j in range(r2):
        acc_j = sum(pdiags[j][i] * ct_ext[i] for i in range(r1))
        acc[j] = KeySwitchDown(acc_j)

    # Step 5: giant-step rotation and sum — 15 EvalRotate calls.
    result = sum(EvalRotate(acc[j], j * r1) for j in range(r2))

    # Step 6: rescale — 1 multiplicative depth level consumed.
    return Rescale(result) # enc(φ), 5 + 31 + 15 = 51 rotations total

Cite This Work

Technical report archived on Zenodo (latest, v2): 10.5281/zenodo.19889993.

// BibTeX @misc{alissaei2026bhdr,
  author     = {Bader Alissaei},
  title      = {{BHDR}: {BSGS}-Hoisted Diagonal Regression for
               Non-Interactive Single-Server Kernel {SHAP} under {CKKS}},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19889993},
  url       = {https://doi.org/10.5281/zenodo.19889993}
}

// Patent and software notices

Patent notice. The methods described in this technical report are related to pending PCT patent application PCT/IB2026/053405, held by VaultBytes Innovations Ltd. The report is licensed under CC BY 4.0 for text, figures, and tables only. No patent license or implementation right is granted by publication of this report. Implementation or commercial use of patented or patent-pending methods may require a separate license.

Software notice. The reference implementation is distributed separately under AGPL-3.0. The software license applies to the code repository only; the publication license for the report applies only to the report text, figures, and tables.