VNN-COMP: Benchmarking the Verification of Neural Networks
A competition that's building trust in AI
In my Benchmarking Primer, I argued that benchmarking is a method for measuring, comparing, and improving machine learning systems. In Formal Verification of ML, I discussed how verification enables us to mathematically prove (or disprove) that a machine learning model always satisfies certain critical properties, such as robustness, safety, and fairness, for all possible valid inputs, not just the ones tested. Together, they address two key aspects of the assurance puzzle: benchmarking reveals how well a system performs, while verification confirms what it guarantees.
But there’s still a critical question: who verifies the verifiers?
That’s where the Verification of Neural Networks Competition (VNN-COMP) comes in. Think of VNN-COMP as where benchmarking discipline meets formal verification guarantees. Each year, the competition brings together verification tools from around the world and tests them under the same conditions: identical models, identical specifications, and identical compute budgets. The result is a clear, comparative picture of how today’s verification techniques perform in practice, cutting through theory and marketing claims to see which tools actually deliver.
A short history of VNN-COMP (2020-2025)
From a friendly kickoff to a flagship global competition, VNN-COMP has grown into the central benchmark for neural network verification. The following section covers key milestones per year, and a comparison table provides a high-level overview of the competition from its inception to the present day.
2020
Inaugural "friendly competition" to build community. Self-reported results on non-standard hardware. Three benchmark categories were considered across the ACAS Xu, MNIST, and CIFAR-10 datasets
2021
Rules were formalised, official rankings were introduced, and ONNX, along with VNN-LIB, became the standard model and specification formats. Scoring was strict but straightforward: +10 points for each correctly verified SAT/UNSAT instance, and a –100 penalty for any incorrect result. To ensure fairness, every tool was run on equal-cost AWS hardware, with teams choosing between a CPU-optimised or a GPU-optimised instance. Six benchmark categories were considered across ACAS Xu, MNIST, CIFAR-10 datasets and one benchmark on database indexing.
2022
Fully automated evaluation pipeline on AWS. Each benchmark had a total timeout of between three and six hours, with randomisation of instances being
mandatory this year, with tool tuning per benchmark level. The benchmark cut across six categories. Similar to the previous year, teams could choose from a range of AWS instance types, providing a focus on CPU, GPU, or a mixed combination.
2023
Continued standardised ONNX/VNN-LIB and equal-cost AWS evaluation, while keeping automated pipeline & standardised counter-example format. Required CE “witnesses” (input examples that verify a reported counter-example is real). Broader AWS instance choices, no time bonus and –150 penalty for incorrect result.
2024
Regular and extended tracks and continued ONNX/VNN-LIB format and equal-cost AWS. Fully automated installation and evaluation pipeline and required CE “witnesses” plus no time bonus and –150 penalty for incorrect results.
2025
The official VNN-COMP 2025 report isn’t out yet, but Safe Intelligence was there. Read our takeaways 👇🏼
Let's have a quick overview of the history.
From the friendly kickoff in 20220 to today's flagship status, VNN-COMP has become the rigorous and automated proving ground for neural network verification itself. It’s where theory meets practice and where trust in AI has to prove itself, because benchmarking alone isn’t enough, and verification without verification is a paradox.