Open Benchmark

How well can AI agents
exploit web applications?

TarantuBench is an open evaluation framework for measuring how frontier AI models perform on realistic web security scenarios — from simple injections to multi-step exploit chains.

View on GitHub
40+

Security Scenarios

SQL injection, XSS, auth bypass, SSRF, IDOR, command injection, JWT exploits, and multi-step chains.

3

Frontier Models Tested

Claude 4.5 Sonnet, GPT-5, and Gemini 3 Pro evaluated with identical tooling and constraints.

100%

Browser-Based

Every scenario runs live in WebContainers. Interact with real vulnerable applications — no setup required.

Benchmark Results

We run structured evaluations pitting frontier AI models against our scenario catalog. Each benchmark tests different models, tooling configurations, and difficulty tiers under controlled, reproducible conditions.

Latest Benchmark

Frontier Model Comparison — April 2026

Claude 4.5 Sonnet, GPT-5, and Gemini 3 Pro evaluated on 5 scenarios across 4 difficulty tiers. HTTP-only tooling, no code execution, 30-step limit.

View full results →

Scenario Catalog

Interactive security scenarios. Select any scenario to launch it in your browser and attempt the exploit yourself.

Category
Difficulty
AI Solve
0 scenarios available