AI Agents Find $4.6M in Smart Contract Exploits
AI Cyber Capabilities Have a Price Tag: $4.6 Million
AI's cybersecurity capabilities are accelerating rapidly; they can now orchestrate complex network intrusions and augment state-level espionage. While benchmarks like CyberGym and Cybench are valuable for tracking these improvements, they often miss a critical dimension: the precise financial consequences.
Quantifying capabilities in monetary terms provides a clearer risk assessment for policymakers, engineers, and the public than arbitrary success rates. Estimating the real value of traditional software vulnerabilities is difficult; it requires speculative modeling of downstream impacts and remediation costs. We took a different approach by turning to a domain where vulnerabilities can be priced directly: smart contracts.
Smart contracts are programs on blockchains like Ethereum that power decentralized financial applications. Their source code and transaction logic are public, and they operate without human intervention. This transparency means vulnerabilities can allow for direct theft; we can measure the dollar value of exploits by running them in simulated environments, making smart contracts an ideal testing ground for AI exploitation capabilities.
To give a concrete example: in November 2025, an attacker exploited a rounding direction issue in the Balancer protocol to steal over $120 million. Since smart contract and traditional software exploits draw on a similar set of core skills like control-flow reasoning and programming fluency, assessing AI agents on smart contracts provides a concrete lower bound on the economic impact of their broader cyber capabilities.
Important Note: To avoid potential real-world harm, our work only ever tested exploits in blockchain simulators. We never tested exploits on live blockchains and our work had no impact on real-world assets.
Introducing SCONE-bench
To facilitate this research, we introduce SCONE-bench—the first benchmark that evaluates AI agents’ ability to exploit smart contracts, measured by the total dollar value of simulated stolen funds. For each target, the agent is prompted to identify a vulnerability and produce an exploit script that increases the executor’s native token balance.
SCONE-bench provides:
- A benchmark of 405 smart contracts with real-world vulnerabilities exploited between 2020 and 2025 across Ethereum, Binance Smart Chain, and Base.
- A baseline agent that attempts to exploit contracts within a 60-minute time limit using tools exposed via the Model Context Protocol (MCP).
- An evaluation framework using Docker containers for sandboxed, scalable, and reproducible execution on forked blockchains.
- Plug-and-play support for auditing smart contracts for vulnerabilities before deployment, helping developers stress-test their code.
Our Three Main Findings
Our evaluation produced three key results:
- Massive Scale Exploitation: Across all 405 benchmark problems, 10 different AI models collectively produced turnkey exploits for 207 contracts (51.11%), yielding $550.1 million in simulated stolen funds.
- Post-Knowledge Cutoff Success: To control for data contamination, we focused on vulnerabilities exploited after the models’ knowledge cutoffs. Here, models like Claude Opus 4.5 and GPT-5 produced exploits for 19 problems (55.8%), yielding $4.6 million. The top model, Opus 4.5, alone exploited 65% of post-cutoff contracts for $3.7 million.
- Novel Zero-Day Discovery: We tested Sonnet 4.5 and GPT-5 against 2,849 recently deployed, live contracts with no known vulnerabilities. The agents uncovered two novel zero-day vulnerabilities and produced exploits worth a combined $3,694, proving that profitable, autonomous exploitation is technically feasible today.
Deeper Analysis: Dollars Over Success Rates
Evaluating exploitation capabilities in dollars stolen is more meaningful than measuring Attack Success Rate (ASR). ASR ignores how effectively an agent can monetize a vulnerability once found. Two agents can both "solve" a problem yet extract vastly different amounts of value.
For example, on one benchmark problem, GPT-5 exploited $1.12M in simulated funds, while Opus 4.5 exploited $3.5M. Opus 4.5 was substantially better at maximizing revenue by systematically attacking many contracts affected by the same vulnerability. ASR treats both runs as equal successes; the dollar metric captures this critical, economically meaningful gap in capability.
Over the last year, frontier models' exploit revenue on the 2025 problems doubled roughly every 1.3 months. This striking trend is attributable to improvements in agentic capabilities like tool use, error recovery, and long-horizon task execution.
In The Wild: Discovering Novel Exploits
To go beyond retrospective analysis, we tested our agent on 2,849 recently deployed contracts on the Binance Smart Chain with at least $1,000 in liquidity and no known vulnerabilities. Both Sonnet 4.5 and GPT-5 agents identified two previously unknown zero-day vulnerabilities.
Vulnerability #1: Unprotected Read-Only Function Enables Token Inflation
A token contract included a public "calculator" function to help users estimate transaction rewards. The developers forgot to add the view modifier, which marks functions as read-only. Without it, the function had write access by default. Each call to the calculator didn't just return an estimate; it updated the system's state and credited the caller with extra tokens. This is analogous to a public API endpoint for viewing account balances that instead increments the balance with each query.
In the simulation, the agent repeatedly called this buggy function to inflate its token balance, then sold the tokens for a potential profit of approximately $2,500.
Vulnerability #2: Missing Fee Recipient Validation in Withdrawal Logic
A contract for one-click token launches collected trading fees to be split between the contract and a beneficiary address. However, if the token creator didn't set a beneficiary, the contract failed to validate the field. This access control flaw allowed any caller to supply an arbitrary address and withdraw the fees. Four days after our agent’s discovery, a real attacker independently exploited the same flaw and drained approximately $1,000.
// Exploit code snippet for Vulnerability #2
// (Illustrative purpose)
function claimFees(address token, address maliciousRecipient) public {
// The vulnerability lies in the lack of validation for who can be a recipient.
// A check like require(msg.sender == owner[token], "Not authorized"); is missing.
IERC20(feeToken).transfer(maliciousRecipient, feeBalance[token]);
}
The Economics of Autonomous Exploitation
How expensive was it to find these zero-days? We analyzed the GPT-5 agent's performance:
- Total Cost: $3,476 to scan all 2,849 candidate contracts.
- Cost Per Run: $1.22 on average per contract.
- Cost Per Vulnerability: $1,738 to identify one vulnerable contract.
- Average Profit: With an average revenue of $1,847 per exploit, the average net profit was $109.
This demonstrates that profitable exploitation is already feasible. We expect these costs to fall sharply. As agents become more capable, they will succeed on more contracts. Furthermore, token costs are plummeting; an attacker today can get about 3.4x more successful exploits for the same budget as six months ago.
Conclusion: The Time for AI-Powered Defense is Now
In just one year, AI agents went from exploiting 2% of post-knowledge cutoff vulnerabilities to over 55%; from $5,000 to $4.6 million in potential revenue. Our discovery of novel zero-days confirms this is not a retrospective exercise. Profitable autonomous exploitation can happen today.
These findings have implications far beyond blockchains. The same agentic capabilities extend to all software. As costs fall and capabilities compound, attackers will deploy AI agents to probe any code along the path to valuable assets. Open-source codebases are the first frontier, but proprietary software will not remain unstudied for long.
Crucially, the same agents that can exploit vulnerabilities can also be used to find and fix them. Our findings are an urgent call to action; now is the time for defenders to adopt AI to stay ahead of the threat.
END_OF_FILE
HASH: 24HIEL0TEED
Related Intelligence
SoundCloud Breach Exposes Millions of User Accounts
SoundCloud confirms a data breach via an ancillary dashboard, exposing emails and public data for potentially 28 million users. Learn the impact and how to respond.
Modern Phishing & the Lookalike Domain Problem
Phishing has evolved into sophisticated brand impersonation. Learn the red flags, incident steps, and how Flawtrack detects and removes phishing domains.
Dark Web Monitoring Tools: A Guide for 2026
Your 2026 guide to dark web monitoring tools. Discover how to track credentials and ransomware threats as Google's Dark Web Report sunsets.
Ready to Secure Your Infrastructure?
Join forward-thinking engineering teams who trust Flawtrack for continuous vulnerability scanning and threat detection.
Get Started Now