Introduction: why proof verification benchmarks matter
Zero-knowledge rollups (zkrollups) have emerged as a leading layer-2 scaling paradigm for Ethereum. By moving computation and state off-chain and submitting succinct validity proofs on-chain, zkrollups offer trustless finality with dramatically lower gas costs. However, the practical performance of a zkrollup depends critically on its proof verification benchmarks. These benchmarks directly influence user experience (latency), operational cost (gas per proof), and the feasibility of advanced features such as recursive aggregation.
In production systems, verification benchmarks are not just academic numbers — they determine whether a rollup can handle thousands of transactions per second while keeping fees under one cent. This article provides a methodical breakdown of key verification metrics, the tradeoffs associated with each, and how to interpret benchmark data produced by different proving systems (e.g., Groth16, PLONK, Halo2, STARKs).
Core verification metrics: latency, gas cost, and memory
When evaluating a zkrollup prover, three primary benchmarks dominate engineering discussions:
- Verification latency — the wall-clock time required to verify a single proof on-chain. This is typically measured in milliseconds on a standard Ethereum execution client (e.g., Geth or Nethermind). For Groth16, verification latency is often under 5 ms; for STARKs, it can range from 50 ms to several seconds depending on the security parameter.
- Gas cost — the L1 gas consumed by the verification contract. This is the most visible economic metric because it directly translates to the cost per proof paid by the rollup operator. Groth16 verifiers typically consume 250k–400k gas per proof, while PLONK-based verifiers may require 400k–800k gas. STARK verifiers, especially on Ethereum, can exceed 1 million gas per proof.
- Memory and state footprint — the per-verification memory used by the verification contract and client. For embedded verifiers or light clients, high memory usage (e.g., >1 MB per proof) can be a bottleneck. Recursive proofs exacerbate this because they must track multiple intermediate states.
A concrete example: the zkSync Era verifier (based on PLONK with custom gates) reports average verification gas of 380k per block of bundled transactions. This is competitive but still higher than the ~280k gas cost of a Groth16 verifier for a single shielded transaction in Zcash. The difference stems from the larger number of polynomial commitments and openings required by PLONK.
Recursion depth and its effect on benchmarks
Recursive proof composition is a central feature of next-generation zkrollups. By proving that a set of sub-proofs are all valid within a single recursive proof, operators can compress batches of thousands of transactions into one on-chain verification. However, recursion depth introduces non-linear scaling in both prover time and verification benchmarks. For a thorough technical treatment of how recursion depth impacts verification overhead, refer to the discussion on Zkrollup Proof Recursion Depth.
The key benchmarks affected by recursion depth include:
- Verifier circuit size — each recursion layer adds constraints to the verifier circuit, increasing the number of gates. For a depth of 3 (or 4 depending on the Merkle tree structure), the verifier circuit may grow by 20–40% compared to a non-recursive configuration.
- Gas cost per recursive proof — because recursion requires verifying the inner proof's pairing check or polynomial evaluation, the gas cost accumulates. Empirical data from Scroll's implementation shows that a depth-4 recursive proof consumes approximately 1.2x the gas of a single proof, not 4x, due to batching optimizations.
- Proof size — iterative aggregation tends to increase proof size linearly with depth for STARK-based systems, whereas pairing-based systems (Groth16) keep proof size constant. This directly affects data availability cost on L1.
Engineers designing rollup infrastructure should benchmark their specific recursion layout (e.g., tree vs. chain) rather than relying on generic claims. A common pitfall is assuming that a recursion library that works well for depth 2 will scale linearly to depth 8 — in practice, constraint blow-up often forces a re-parameterization of the cryptographic group or the commitment scheme.
Comparing proving systems: Groth16 vs. PLONK vs. STARK
Each major proving system exhibits a distinct profile of verification benchmarks:
- Groth16 — offers the fastest verification (single pairing check, ~250k gas), but requires a trusted setup per circuit. Verification latency is under 2 ms on modern hardware. The tradeoff is that any circuit change forces a new setup, making it less flexible for evolving rollup logic.
- PLONK — eliminates the per-circuit trusted setup (only a universal setup is needed). Verification is slower, typically 400k–700k gas, due to batch openings of polynomial commitments. PLONK verifiers also require a larger state size (e.g., 1400–1800 F_p elements stored temporarily).
- STARKs (e.g., Winterfell, StarkWare) — no trusted setup at all and post-quantum security. However, STARK verification on Ethereum is expensive: gas costs can exceed 1.5 million, and proof sizes are large (tens of kilobytes). For layer-2 scenarios, STARKs often use a "STARK-to-SNARK" wrapper to compress verification, introducing extra complexity and latency.
For a practical developer evaluation, the relevant benchmark is not just the raw verification gas but the amortized cost per transaction. A rollup that bundles 500 transactions into one Groth16 proof achieves an amortized gas cost of ~500 gas per tx (250k / 500), which is far below L1 ERC-20 transfer costs of roughly 21k gas. Even with the overhead of recursive aggregation, the amortized cost remains competitive as long as the batching factor is high.
One important nuance: verification benchmarks measured in isolation (e.g., in a Rust test harness) often differ from on-chain benchmarks due to Ethereum's gas pricing for opcodes like ECADD, ECMUL, and pairing. For example, the ECPAIRING precompile (0x08) costs 45k gas base + 34k per pairing pair. A Groth16 verifier uses exactly one pairing pair, yielding ~79k gas for the pairing step alone. In contrast, a STARK verifier may use dozens of hash calls (SHA256 or Poseidon), each costing 30–100 gas. These differences are well-documented but still catch teams off-guard during mainnet deployment.
Practical benchmarking methodology
To obtain reliable verification benchmarks for a zkrollup, follow these steps:
- Choose a proving library — such as
bellman(Groth16),halo2(PLONK variant), orstarkware-crypto. Each has optimizations that affect verification. - Deploy the verifier contract to a local Ethereum testnet (e.g., Hardhat or Anvil). Measure gas using
ethers.jsorweb3witheth_estimateGas. Run at least 10 iterations to account for node caching. - Record verification latency using the EVM's block time. In production, if the verifier takes longer than the block interval (12 seconds), the rollup state cannot advance each slot. For sub-12-second verification, test with a single proof and then with a batch of proofs.
- Profile memory by inspecting the contract's stack usage or using a debugger. High memory consumption can cause "stack too deep" errors or force the use of calldata instead of memory, increasing gas.
A critical parameter that many developers overlook is the commitment scheme. For example, using BLS12-381 (which has 384-bit fields) vs. BN254 (254-bit) changes pairing costs and proof sizes. BLS12-381 offers higher security but larger gas consumption because the EVM precompiles are optimized for BN254. Always benchmark with the specific elliptic curve your rollup plans to use.
Tradeoffs and optimization strategies
Given the benchmarks described above, engineers face recurring tradeoffs:
- Batch size vs. latency — larger batches reduce amortized gas but increase verification latency and risk exceeding the 12-second block window. A common sweet spot is 200–800 transactions per batch.
- Recursion depth vs. simplicity — deeper recursion enables higher compression but adds circuit complexity and may increase proof generation time by 10x or more. Some rollups (e.g., Loopring) opt for shallow recursion (depth 2 or 3) to keep prover hardware costs reasonable.
- Gas cost vs. proof finality — STARKs offer faster finality (since they do not require Ethereum blocks for data availability ordering) but at a higher gas cost per proof. For high-value DeFi applications, the faster finality may justify the extra gas.
The ecosystem also sees increasing adoption of hardware acceleration (e.g., FPGA-based multiexponentiation) to reduce prover and verifier latency. However, verification benchmarks on Ethereum remain constrained by the EVM's execution model. Future upgrades like EIP-4844 (proto-danksharding) and proposed precompile improvements for multi-scalar multiplication (MSM) may shift the tradeoffs significantly, but current benchmarks remain as described.
For teams building automated trading strategies on top of zkrollups, understanding these benchmarks is essential for estimating execution costs. Advanced platforms that integrate zkrollup proof verification with automated infrastructure can be explored via Crypto Trading Automation, where latency and gas modeling directly inform strategy parameters.
Conclusion
Zkrollup proof verification benchmarks are not a single number but a multidimensional tradeoff space involving latency, gas cost, memory, recursion depth, and proving system choice. For production systems, the most actionable benchmarks are:
- Amortized gas per transaction (target: < 1000 gas on L1).
- Verification latency (target: < 2 seconds per batch).
- Proof size (target: < 1 kB for pairing-based, < 50 kB for STARK-based).
- Recursion overhead (target: < 30% additional constraints per depth).
By methodically benchmarking their specific circuit and batching strategy, developers can avoid costly deployment surprises and optimize for their target use case — whether that is high-throughput payment settlement, decentralized exchange order books, or privacy-preserving transfers. The field is evolving rapidly, and staying informed about benchmark data from production verifiers (e.g., zkSync, Scroll, StarkNet) is a practical necessity for any engineering team building on zkrollups.