OpenAI has partnered with Paradigm, a leading crypto investment firm, to introduce EVMbench, a groundbreaking benchmark aimed at evaluating AI agents’ proficiency in identifying, rectifying, and exploiting significant vulnerabilities within smart contracts.
This innovative release marks a pivotal advancement in assessing artificial intelligence’s capabilities in environments that are economically vital. Smart contracts, which are instrumental in securing more than $100 billion in open-source crypto assets, stand at the forefront of this technological evaluation.
Comprehensive Vulnerability Assessment
EVMbench derives its analysis from a collection of 120 meticulously curated vulnerabilities, gathered from 40 distinct security audits. A significant portion of these vulnerabilities originated from open code audit competitions hosted on platforms like Code4rena.
Moreover, the benchmark extends its focus by incorporating scenarios from the security auditing process of the Tempo blockchain, a Layer 1 platform crafted for high-throughput stablecoin payments. This inclusion broadens EVMbench’s applicability, particularly in the realm of payment-centric smart contract coding, where stablecoin transactions are anticipated to surge.
Three Modes of Evaluation
The EVMbench framework assesses AI agents’ competencies across three specific modes: detect, patch, and exploit. Each mode addresses a unique phase in the lifecycle of smart contract security.
In the detect mode, agents are evaluated on their ability to audit a repository and accurately recall known vulnerabilities. Patch mode requires agents to amend flawed contracts while maintaining their intended functionalities, confirmed through automated testing. The exploit mode challenges agents to conduct comprehensive fund-draining attacks in a controlled, sandboxed blockchain environment.
To ensure reproducibility, OpenAI has developed a Rust-based harness that deploys contracts predictably, limiting unsafe RPC methods. All exploitation tasks are executed in an isolated local Anvil environment, away from live networks.
Performance and Future Outlook
Initial results from EVMbench show significant variation in performance across different task types. In exploit mode, the GPT‑5.3‑Codex model achieved a remarkable 72.2% score, a dramatic improvement from its predecessor, GPT‑5, which scored 31.9% just six months earlier.
While agents excel in exploit tasks due to their clear objectives, detect and patch modes present greater challenges. Agents often cease operations after identifying a single vulnerability and struggle to correct subtle flaws without disrupting existing contract functionalities.
OpenAI acknowledges that EVMbench does not entirely capture the complexities of real-world smart contract security. The current grading system is unable to distinguish true vulnerabilities from false positives when agents exceed the usual human-auditor findings.
In conjunction with EVMbench’s release, OpenAI has allocated $10 million in API credits through its Cybersecurity Grant Program to promote defensive security research, with a focus on open-source software and critical infrastructure. Furthermore, the company has announced the expansion of Aardvark, its security research agent, now available through a private beta program. EVMbench’s tasks, tools, and evaluation framework are publicly accessible to support ongoing research into AI-driven cybersecurity capabilities.
Stay updated with our daily cybersecurity news on Google News, LinkedIn, and X. Reach out to us to share your stories.
