BHDR: BSGS-Hoisted Diagonal Regression for Non-Interactive Single-Server Kernel SHAP under CKKS

Alissaei, Bader

doi:10.5281/zenodo.19889993

← All Research

[Research Publication — April 2026]

BHDR: A 43× Rotation Reduction for Encrypted Kernel SHAP

BHDR = BSGS-Hoisted Diagonal Regression · BSGS = Baby-Step Giant-Step

BHDR computes Kernel SHAP attributions under CKKS fully homomorphic encryption on a single server, non-interactively. The client's input, the prediction, and the explanation all stay encrypted on the server. The closing weighted-least-squares regression was 84% of the pipeline cost at 7.7s. Observation: the regression matrix M = (ZᵀWZ)⁻¹ ZᵀW depends only on public sampling parameters, so we precompute it offline in plaintext, then evaluate M · enc(y) with a BSGS-hoisted diagonal matvec over a K′-periodic replicate encoding. Rotations drop from ~2,200 to 51 at d = 50. Regression time drops to 0.6s. The full pipeline is 4.4× faster.

// Threat model: honest-but-curious single-server adversary. Server follows the protocol, attempts to learn from ciphertexts. Integrity against malicious servers and output privacy against curious decryption oracles are treated in companion work (Freivalds/LCV-IPA; IND-CPA^D flooding).
// Technical report on Zenodo (v2, latest): 10.5281/zenodo.19889993. Reference implementation distributed separately under AGPL-3.0; request artifact access for review.

[read_technical_report] [cite_this_work] [request_artifact_access]

Author Bader Alissaei

Date April 2026

Affiliation VaultBytes Innovations Ltd

Topic FHE · CKKS · Kernel SHAP · Explainability · Privacy-Preserving ML

Keywords BSGS · Diagonal Matvec · SIMD Packing · Matrix Bernstein · OCTE

What We Built

GDPR requires personal data to stay protected during processing. The EU AI Act introduces transparency obligations (Article 13) for high-risk systems and explanation rights (Article 86) in specified circumstances. A regulated-ML stack may need both privacy-preserving inference and feature-level explanations; BHDR explores one technical path for combining them.

Why regulated ML may want Shapley under FHE

GDPR, HIPAA, and CCPA require personal data to be protected during processing. The EU AI Act adds transparency obligations under Article 13 for high-risk systems, plus explanation rights under Article 86 for decisions affecting natural persons in specified circumstances; high-risk obligations take effect from August 2026. Computing SHAP on plaintext keeps the data visible at the server. Running standard FHE inference protects the data but produces only predictions, not attributions. BHDR is one technical path that handles both stages of the pipeline under encryption; whether it satisfies any specific compliance regime is a legal determination that depends on the deployment context.

Why no prior system did this

We surveyed the Nguyen et al. 2024 privacy-preserving XAI review, IACR ePrint, and the proceedings of CCS, S&P, USENIX Security, NeurIPS, and ICML through early 2026, and did not find a single-server FHE construction that computes Kernel SHAP end-to-end without decrypting the input. XorSHAP (MPC) and gradient-based FHE attribution exist; Shapley-based FHE explanations do not. The reason is cost. A natural port of Kernel SHAP's closing weighted-least-squares regression consumes around 2,200 ciphertext rotations per query at d = 50 on our production binding path, because every output feature wants its own dot product over K packed slots. That puts the regression at multiple seconds per query, dominating the rest of the pipeline.

Offline-precomputed regression matrix

Kernel SHAP's closing step is φ = M · y, where M = (ZᵀWZ)⁻¹ ZᵀW, and both Z and W are public: Z is the coalition sampling matrix, W holds the kernel weights. M depends on no secret. We compute it once, offline, in plaintext. At query time only the coalition outputs y are encrypted, and the regression becomes a plaintext-matrix × encrypted-vector product. HESV (VLDB 2023) made a similar observation in the MPC setting. Our contribution is combining it with a BSGS-hoisted diagonal matvec and K′-periodic replicate encoding so that the matvec works on a single server under CKKS.

BSGS-hoisted diagonal matvec

We encode M on the diagonals of a K′-periodic replicated slot layout (K′ = 512 ≥ K = 390) and evaluate M · enc(y) with a Baby-Step Giant-Step split of r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting: one shared KeySwitchDown for the whole group. Total rotation budget: 31 baby-step, 15 giant-step, 5 replication doublings, for 51 amortized keys. That is 43× below the production BSGS baseline, and 381× below the textbook d(K−1) bound.

// Summary

The expensive part of Kernel SHAP is (ZᵀWZ)⁻¹ ZᵀW. Both Z and W are public. Neither depends on the input, the prediction, or the user. We do the inversion once, in plaintext, before any ciphertext exists. The server then runs a single matrix-vector product per query: a known plaintext matrix against an encrypted vector. We evaluate that product with BSGS-hoisted diagonals so that d output coordinates share one group of rotations instead of each paying for its own. Public-parameter offloading alone gets you the offline cost. The hoisted matvec is what drops online cost from seconds to sub-second.

// BHDR pipeline — where each step runs

Offline (plaintext)

Public params

d, K, seed

→

Sample Z, W

coalitions + weights

→

M = (ZᵀWZ)⁻¹ ZᵀW
plaintext pre-compute

→

Diagonals of M

K′-periodic packing

Client (secret key)

Input x

clipped to [−R, R]ᵈ

→

CKKS encrypt

enc(x), enc(baseline)

→

Send to server

ciphertext only

Server (honest-but-curious)

Coalition mask

Z · enc(x)

→

Model eval

f(enc(·)) per coalition

→

BHDR matvec
51 rotations total

→

enc(φ)

SHAP ciphertext

Client (secret key)

Decrypt enc(φ)

recover SHAP values

→

Display

per-feature attribution

Offline (public, one-shot) Client (holds secret key) Server (ciphertext only)

// Rotation budget for the closing regression (d = 50, K = 390, K′ = 512)

Naïve bound

d × (K−1)

one row per output

→

19,450 rotations

unshipped worst case

v1 production

Row-wise BSGS

per-output batch

→

~2,200 rotations

7.70 s on M1 ⚠

BHDR

Hoisted diagonals

r₁ = 32, r₂ = 16

→

51 rotations

0.60 s on M1 ✓

43× reduction

vs. production

The Numbers

43×

Rotation reduction
vs. v1 production BSGS

Total rotations
for full d = 50 regression

0.6 s

BHDR regression step
(Apple M1)

13.4 s

Deployed end-to-end p50
(N_q = 300 UCI Adult, M1)

0 / 300

Silent CKKS overflows
post input-guard fix (Wilson UB 1.26%)

// Pipeline latency breakdown — Apple M1, d = 50, K = 390 (lower is better)

v1 naïve regression

0.92

0.54

7.70 s regression

9.2 s

v2 BHDR regression

0.92

0.54

0.60

2.1 s

Coalition masking (200) Packed model eval (300) v1 regression (400) BHDR regression (400)

The BHDR-hoisted diagonal regression collapses the regression step from ~84% of the v1 pipeline (7.70 s) to ~29% of the v2 pipeline (0.60 s). Coalition masking and model evaluation are unchanged by BHDR; the bottleneck now sits in the model-evaluation stage, setting up the next optimisation target. Overall pipeline speedup: 4.4×.

// End-to-end encrypted pipeline (d = 50, K = 390, K′ = 512, 128-bit CKKS)

Component	Apple M1 · v1	Apple M1 · v2 (BHDR)	2 vCPU x86 · v1	2 vCPU x86 · v2 (BHDR)
Coalition masking (200)	0.92 s	0.92 s	~5 s	~5 s
Packed model eval (300)	0.54 s	0.54 s	~30 s	~30 s
Regression (400)	7.70 s	0.60 s	~60 s	~10–15 s†
Total pipeline	9.2 s	2.1 s	~95 s	~45–50 s†

† Projected from 5.3 s (mean) / 5.9 s (p95) BHDR regression measured on a 4 vCPU AMD EPYC research server, scaled by the empirical ~2× x86/EPYC ratio. The 9.2 s → 2.1 s totals are a narrow-d, warm-cache algorithm-level ablation (key setup amortized, encrypted baseline reused). The deployed end-to-end pipeline at d = 50 over N_q = 300 UCI Adult queries on Apple M1 measures p50 = 13.4 s, p95 = 16.3 s, p99 = 24.2 s, mean = 14.1 s (±4.0 s), with 0/300 silent CKKS overflows after the input-guard fix (Wilson 95% upper bound 1.26%) and 0 MB resident-memory growth.

// OCTE tree-ensemble feasibility study (measured end-to-end)

OCTE · T = 100, D = 4

~53 s on a 2 vCPU x86 cloud server · passes accuracy gate

DEPLOYED

OCTE · T = 100, D = 6

806 s on an 8-core AMD cloud server · L_∞ = 7.9 × 10⁻³

DIAGNOSTIC

The Oblivious Coalition Tree Evaluator (OCTE) evaluates all 2ᴰ tree paths in parallel. It uses a tanh–Chebyshev sign surrogate at about 5 CKKS levels, path-indicator products at ⌈log₂ D⌉ levels, and plaintext leaf aggregation. D = 4 deploys. D = 6 is a diagnostic — it passes the SHAP accuracy gate at L_∞ = 7.9 × 10⁻³ but is not deployed: it consumes all available multiplicative levels and the wall-clock latency is too high. The binding constraint at D = 6 is latency and depth margin, not accuracy. A composite minimax sign gate and GPU rotation acceleration (the CAT framework reports 68–74× speedups on RTX 4090) are experimental paths we have not deployed.

The regression matrix M = (ZᵀWZ)⁻¹ ZᵀW depends only on the public coalition sampling matrix Z and kernel weights W. It carries no input, prediction, or secret. Inversion runs once, in plaintext, before any ciphertext exists.
BSGS split is r₁ = 32 baby steps and r₂ = 16 giant steps. Baby-step rotations use EvalFastRotationExt with level-2 hoisting and no per-rotation KeySwitchDown. Giant-step rotations use EvalRotate, which is about 2.7× more expensive per call, but there are only 15 of them. Total: 31 baby + 15 giant + 5 replication doublings = 51 amortized rotation keys.
Coalition count scales as K = O(d log d). The deployed sampler is a seeded deterministic stratified-ramp design with antithetic complement pairing, giving K = 390 at d = 50. For d ≤ 10 we enumerate exactly (K = 2ᵈ − 2).
The approximation analysis includes a conservative i.i.d. Matrix Bernstein baseline. Under bounded residuals alone, it has the same 1/√K rate shape as a per-coordinate Hoeffding union bound, up to a √(log d) polylog slack — we do not claim a √d improvement over Hoeffding. Tight Bernstein analysis of the deployed stratified-ramp sampler is open work. The deployed guarantee in production is the release-time certificate, not this loose analytical envelope.
Element-wise error between FHE and plaintext SHAP on the same input is at most 1.35 × 10⁻⁴ (max 7.8 × 10⁻⁵ over N_q = 300). The axiom delta between the two paths is 1.65 × 10⁻⁵. After redistribution, the efficiency-axiom residual is 2.2 × 10⁻¹⁶ (machine epsilon) on both paths.
OpenFHE parameters: N = 2¹⁵, L = 10, 128-bit security for logistic regression. OCTE uses N = 2¹⁶ for the deeper modulus chain required by the comparison polynomials. The end-to-end FHE benchmark over N_q = 300 UCI Adult queries reports 0/300 silent CKKS overflows after the input-guard fix (Wilson 95% UB 1.26%), 0 MB resident-memory growth, and 9/300 SDK input-clip events (3%).
Confidentiality holds against an honest-but-curious server: the server observes ciphertexts only. Input, prediction, and explanation all stay encrypted. Output privacy (IND-CPA^D flooding) and integrity against a malicious server (Freivalds / LCV-IPA) are handled in companion work.
OCTE at D = 4, T = 100 runs in about 53 s end-to-end on single-threaded CPU OpenFHE (2 vCPU x86 cloud server) and is the deployed configuration. D = 6 is a passing diagnostic at 806 s on an 8-core AMD cloud server with L_∞ = 7.9 × 10⁻³; it consumes all 31 available multiplicative levels and is constrained by latency and depth margin, not accuracy. We do not claim MLP support in this paper.
Companion work covers two items out of scope here: (a) the Kernel SHAP efficiency-axiom-vs-ranking-stability trade-off introduced by post-hoc residual redistribution, which applies to any Kernel SHAP deployment, encrypted or not, and (b) the cryptographic integrity layer against a malicious server. This paper is the systems and approximation result only.

Pseudocode — offline matrix, BSGS-hoisted online matvec (matches Algorithm 1 in the paper)

The offline function runs once per (d, K, seed). The online function runs per query on a single encrypted vector and uses 51 rotations total, partitioned as 5 replication doublings + 31 baby-step + 15 giant-step.

/* === Offline, plaintext, run once per (d, K, seed) === */

def precompute_diagonals(d, K, seed, Kp=512, r1=32, r2=16):

    Z = sample_coalitions(d, K, seed)            # {0,1}^{K×d}

    W = kernel_weights(Z)                        # Lundberg–Lee

    M = inv(Z.T @ W @ Z) @ Z.T @ W               # d × K, plaintext

    M_pad = pad_columns(M, target=Kp)            # K → K′ = 512

    diags = halevi_shoup_diagonals(M_pad)        # K′ diagonals

    return pre_rotate_giantstep(diags, r1, r2) # p'[j,i], tiled K′-periodic

/* === Online, per query, on encrypted enc(y) — Algorithm 1 === */

def bhdr_regression(enc_y, pdiags, r1=32, r2=16, Kp=512, n=16384):

    # Step 1: K′-periodic replication — tile enc(y) with period K′

    #         via ⌈log₂(n/K′)⌉ = 5 rotate-and-double operations.

    enc_y_rep = enc_y

    for s in range(5):                        # offsets K′, 2K′, 4K′, 8K′, 16K′

        enc_y_rep = enc_y_rep + EvalRotate(enc_y_rep, Kp << s)

    # Step 2: baby-step precompute (level-1 hoisting — shared digits).

    digits = EvalFastRotationPrecompute(enc_y_rep)

    # Step 3: baby-step rotations — 31 fast rotations,

    #         kept in extended CRT basis (level-2 hoisting).

    ct_ext = [EvalFastRotationExt(enc_y_rep, i, digits)

              for i in range(r1)]    # i = 0, ..., r1−1

    # Step 4: giant-step accumulation in extended basis,

    #         then KeySwitchDown per giant step.

    acc = [None] * r2

    for j in range(r2):

        acc_j = sum(pdiags[j][i] * ct_ext[i] for i in range(r1))

        acc[j] = KeySwitchDown(acc_j)

    # Step 5: giant-step rotation and sum — 15 EvalRotate calls.

    result = sum(EvalRotate(acc[j], j * r1) for j in range(r2))

    # Step 6: rescale — 1 multiplicative depth level consumed.

    return Rescale(result)    # enc(φ), 5 + 31 + 15 = 51 rotations total