OpenAI’s EVMbench: The Latest Comedy of Errors in Crypto Security

In an astonishing turn of events that could only be concocted by the most imaginative of novelists, OpenAI has unveiled EVMbench, a whimsical contraption aimed at testing AI agents on smart contract security. This follows the rather unfortunate incident where Claude Opus 4.6 assisted in triggering a rather eye-watering $1.78 million DeFi exploit.

Now, smart contracts are like the knights of yore, valiantly protecting over $100 billion in open-source crypto assets. One would think such a hefty sum would warrant a bit more scrutiny, and indeed, OpenAI’s latest caper is capturing the attention of the masses-if only to chuckle at the absurdity of it all. Collaborating with crypto investment firm Paradigm, they’ve rolled out EVMbench, a benchmark that tests how well AI agents can sniff out, exploit, and patch high-severity smart contract vulnerabilities. Quite the tall order, wouldn’t you say?

But here’s the kicker: this benchmark is not just a random assortment of vulnerabilities. No, it draws from a veritable treasure trove of 120 carefully curated vulnerabilities across 40 audits. Most of these came from open code audit competitions, and what sets it apart, you ask? Oh, just the fact that EVMbench tests three distinct capability modes: detect, patch, and exploit-each graded separately through a Rust-based contraption that replays transactions in a sandboxed local environment. Think of it as a playground for AI agents, minus the swings and slides.

You might also like: Claude-Generated Code Linked to $1.78M DeFi Hack

The Number That Should Worry Everyone

Ah, now we arrive at the crux of the matter. In exploit mode, GPT-5.3-Codex via Codex CLI scored a staggering 72.2%. Just six months ago, our dear old GPT-5 was languishing at a mere 31.9% on the same metric. Not quite a small gap, I must say! OpenAI confirmed these figures in a rather self-congratulatory announcement on X, positioning EVMbench as both a tool for measurement and a clarion call to the security community. How noble!

However, let us not get ahead of ourselves. Detect and patch scores are still playing hide-and-seek. Agents in detection mode occasionally identify a single vulnerability and then, like a child distracted by a butterfly, simply stop. They seem to forget that there’s an entire codebase to explore! And in patch mode, the challenge is akin to balancing on a tightrope while juggling flaming torches-preserving full contract functionality while excising the flaw is proving to be quite the conundrum.

Must read: Trust Wallet Security Hack: How to Safeguard Your Assets

A $1.78M Oracle Error Nobody Caught

Now, let us set the stage for this drama. Security researcher evilcos flagged on X that the DeFi lending protocol Moonwell suffered a catastrophic loss of approximately $1.78 million due to an Oracle configuration error. Imagine setting cbETH’s value at $1.12 instead of the correct figure of about $2,200. Such a blunder belongs in a farce, not a financial protocol!

This low-level mistake should have been caught by any half-decent audit. Yet, the GitHub pull request for proposal MIP-X43 revealed commits co-authored by none other than Claude Opus 4.6, Anthropic’s pride and joy at the time. Oh, the irony!

Smart contract auditor pashov took to X to point out what may very well be the first exploit tied to vibe-coded Solidity. He was careful to emphasize that human reviewers ultimately hold the keys to the kingdom. A security auditor must sign off before anything goes on-chain, but alas, something in that chain seems to have rusted.

What EVMbench Is Actually Built to Do

EVMbench includes scenarios from the security audit of the Tempo blockchain, which is nothing short of a purpose-built L1 designed for high-throughput stablecoin payments. This extension pushes EVMbench into payment-oriented contract code, an area where OpenAI predicts agentic stablecoin activity will flourish. Or at least that’s the plan!

Each exploit task runs in an isolated Anvil instance, where transactions replay themselves with the predictability of a bad sitcom. The grading setup restricts unsafe RPC methods and was red-teamed internally to prevent agents from cheating. After all, we wouldn’t want them to gain the upper hand unfairly.

Furthermore, OpenAI is pledging a cool $10 million in API credits to bolster cyber defense efforts, with priority given to open-source software and critical infrastructure. Their trusty security research agent, Aardvark, is expanding into private beta, offering free codebase scanning for widely used open-source projects. All very grand, isn’t it?

The Vibe-Coding Question With Real Stakes

Pashov’s musings on X raised a rather uncomfortable question that many in the DeFi community had been tiptoeing around. When AI writes production Solidity code and humans rubber-stamp it in a frenzy, the review layer becomes alarmingly thin. The Moonwell incident serves as a stark reminder of just how perilously slim that layer can be.

OpenAI has acknowledged that cybersecurity is intrinsically dual-use, and their response is grounded in evidence. Safety training, automated monitoring, and access controls for advanced capabilities are all part of the package. But a 72.2% exploit score on a public benchmark is the kind of number that tends to rattle a few cages.

EVMbench’s complete task set, tooling, and evaluation code are now available for public viewing. The goal is to allow researchers to keep tabs on AI cyber capabilities as they evolve and build defenses to match. Whether that pace is swift enough remains a question for the ages-one that no one seems inclined to answer just yet.

Read More

2026-02-19 21:47