Production GPU Groth16 stack — Gnark (Consensys) wrapping Ingonyama's ICICLE GPU primitives. What every SP1 mainnet deployment uses today for the STARK→SNARK wrapping step.
"ICICLE-snark is now the fastest Groth16 prover implementation"
ZKX is our proof compiler, and PrimeIR is the optimization layer beneath it. Instead of hand-tuning each prover backend, this stack compiles proving workloads into optimized primitive kernels across schemes and hardware targets.
The benchmark sections below show the kernel-level gains (FFT, IFFT, MSM, SMCS, LOGUP GKR) and the application-level outcomes (Groth16 prove, zkVM block proof). The payment gateway demo shows what this means in product terms: proof generation fast enough for real user-facing flows.
ZK proving today looks like ML did around 2015 — every scheme has its own hand-tuned prover, and every new (scheme × hardware) combination is a fresh manual optimization round.
ML solved this with XLA. TensorFlow, PyTorch, and JAX stopped emitting hand-tuned CUDA kernels and started emitting an IR that XLA lowers to whatever hardware target you bring. PyTorch + XLA on a GPU often beats hand-written CUDA C++ — the compiler sees the whole computation graph; the manual writer only sees one kernel at a time.
ZKX takes the same path for ZK — and it's not a loose analogy. We build on the same IR infrastructure (StableHLO + HLO) and add a ZK-specific dialect (PrimeIR) on top. Given any proving scheme, ZKX produces optimized proof generation through just-in-time compilation and cluster-level optimization, targeting CPU, GPU, and ASIC from a single codebase.
Both domains hit the same wall: ZK proving and ML training are both memory-bound on modern GPUs. The compiler optimizations that won for XLA — fusion, layout, scheduling — are the same ones that move the needle here.
Backend kernel improvements from ZKX + PrimeIR, measured on FFT, IFFT, MSM, SMCS, and LOGUP GKR.
rabbitsnark-py · bn254 · gpu · degree 22
| Family | Setup | Baseline | zkx | Speedup |
|---|---|---|---|---|
| FFT | bn254 · gpu · d22 · Gnark | 28.334 ms | 1.997 ms | 14.2× |
| IFFT | bn254 · gpu · d22 · Gnark | 28.835 ms | 2.066 ms | 14.0× |
| MSM | bn254 · gpu · d22 · Gnark | 56.380 ms | 29.835 ms | 1.9× |
| SMCS | koalabear · gpu · d20 · SP1 | 1.736 ms | 1.321 ms | 1.3× |
| LOGUP GKR | koalabear · gpu · d22 · SP1 | 108.394 ms | 38.438 ms | 2.8× |
Two end-to-end workloads drive most production GPU prover cost — Groth16 STARK→SNARK wrapping and zkVM block proving. ZKX is faster than the prover each category currently calls SOTA.
The two provers the ZK industry currently calls fastest in their categories. ZKX is faster than both — same workload, same hardware, no protocol changes.
Production GPU Groth16 stack — Gnark (Consensys) wrapping Ingonyama's ICICLE GPU primitives. What every SP1 mainnet deployment uses today for the STARK→SNARK wrapping step.
"ICICLE-snark is now the fastest Groth16 prover implementation"
Succinct Labs' current production prover for SP1 — the published frontier for end-to-end zkVM block proof latency. What every Succinct mainnet deployment runs today to generate Ethereum block proofs.
Check how fast a real-world activity can land on-chain. Star a GitHub repo → ZKX generates the proof → Solana settles the payout, all in seconds.
A Solana program that gates payments on a ZK proof of an off-chain attestation. AI-agent payouts, verified on-chain.