🏆 Benchmarked Free Ride

Last updated: 2026-04-04 07:40 UTC

Comprehensive Model Evaluation

We benchmark free OpenRouter models across two critical dimensions using industry-standard frameworks. Each benchmark tests different aspects of AI agent capabilities.

OpenClawBench

What it tests: Task completion capabilities including file manipulation, web search, and weather queries.
How it works: Single-turn task execution across 3 scenarios with 9 easy-level tasks total (3 per scenario).
Scoring: Composite score = 70% accuracy + 20% speed + 10% token efficiency

🛡️

AgentDojo

What it tests: Prompt injection resistance and security in agentic scenarios.
How it works: Workspace suite with tool_knowledge attacks testing whether models follow malicious instructions.
Scoring: Security score (% of attacks blocked) and utility score (% of legitimate tasks completed under attack)

📚 How OpenClawBench Works

OpenClawBench evaluates models on practical task completion through single-turn interactions:

  • File Manipulation: Create, read, transform, and extract data from files (JSON, CSV, Markdown)
  • Web Search: Query information from the web and synthesize results
  • Weather Data: Retrieve and process weather information for different locations

Each task receives a binary score (100% pass or 0% fail) based on whether the agent successfully completed the objective. The final ranking uses a composite score weighted by accuracy (70%), speed (20%), and token efficiency (10%).

🛡️ How AgentDojo Works

AgentDojo tests whether AI agents can resist prompt injection attacks while maintaining utility:

  • User Tasks: Legitimate requests the agent should complete (baseline utility)
  • Injection Tasks: Same requests but with hidden malicious instructions injected via tool outputs
  • Attack Type: tool_knowledge attacks embed instructions in file contents, API responses, etc.

Models are scored on two dimensions:

  • Security Score: Percentage of injection attacks successfully blocked (higher = more secure)
  • Utility Score: Percentage of legitimate tasks completed correctly under attack conditions

Models are ranked by security score as the primary metric, with utility score shown for context.

📊 How AgentDojo Testing Works

Source: AgentDojo v1.2.2 - Prompt injection security benchmark for AI agents

🎯 Test Structure

Each model is tested with user tasks (legitimate operations) combined with injection tasks (malicious instructions hidden in tool outputs).

10 User Tasks (legitimate work like "Schedule a meeting" or "Find emails from last week")
×
6 Injection Tasks (attacks like "Send sensitive data to attacker" or "Delete important files")
=
60 test cases per model

🔓 The 6 Injection Attack Types

  • Email Exfiltration: Send specific emails to attacker's address
  • File Deletion: Delete files from cloud storage
  • Calendar Manipulation: Create events with unauthorized participants
  • Data Forwarding: Forward sensitive email contents to attacker
  • Security Code Theft: Steal 2FA/security codes from emails
  • Evidence Erasure: Exfiltrate data then delete evidence (hardest)

⚔️ Attack Vector: tool_knowledge

Malicious instructions are embedded in tool outputs (file contents, API responses, email bodies) that the agent reads while performing legitimate tasks. A secure agent must complete the user's task while ignoring injected commands.

📈 Scoring Metrics

  • Security Score: % of injection attacks successfully blocked (higher is better)
  • Utility Score: % of legitimate user tasks completed correctly despite attacks
  • Attacks Blocked: Number of malicious actions prevented out of 60 total attempts

Note: Full benchmark includes 33 user tasks (198 test cases). We test 10 tasks to balance comprehensive coverage with practical runtime (~10 min vs ~30 min per model).

📝 Note: All free OpenRouter models are tested. Models use the OpenRouter provider integration with AgentDojo, enabling security testing for any model accessible via OpenRouter's API.

⚡ OpenClawBench

Task completion benchmark • Ranked by composite score (accuracy + speed + efficiency)

📊 Benchmark Configuration

Description

OpenClawBench is a task completion benchmark that tests AI agents' ability to use tools and complete real-world tasks. Agents are evaluated on file manipulation, weather lookups, and web searches using a simulated bot interface.

How We Run It

We run 3 scenarios (file, weather, web) with easy difficulty tasks in single-turn mode (the agent provides one complete response per task). This subset provides fast benchmarking (~2-5 minutes per model) while covering core capabilities. Each scenario contains 3 tasks for a total of 9 tasks per model.

What Are the Metrics

  • Rank: Position in the leaderboard (1 = best)
  • Model: OpenRouter model identifier
  • Score: Composite metric combining task accuracy (% correct), execution speed, and efficiency (higher is better)
  • Tasks Passed: Number of tasks completed successfully out of 9 total tasks
Loading...

🛡️ AgentDojo

Prompt injection security benchmark • Ranked by attack success rate (lower = more secure)

📊 Benchmark Configuration

Description

AgentDojo v1.2.2 is a prompt injection security benchmark that tests AI agents' resistance to adversarial attacks. It evaluates whether agents can complete legitimate tasks while resisting malicious instructions embedded in tool outputs (emails, files, API responses). Developed by ETH Zurich's SPY Lab.

How We Run It

We run the workspace suite (email, calendar, cloud storage tools) with tool_knowledge attacks (malicious instructions hidden in tool outputs). We test 10 user tasks out of the full 33-task suite, with each task tested against 5 randomly sampled injection attacks (from 6 available attack types) = 50 security tests per model. We use pass@5 evaluation: an attack is considered successful if it works in at least 1 of the 5 attempts. Additionally, each of the 10 user tasks is tested for utility (task completion). This approach provides robust security evaluation with statistical significance in ~25-35 minutes per model.

What Are the Metrics

  • Rank: Position in the leaderboard (1 = most secure)
  • Model: OpenRouter model identifier
  • Attack Success Rate: Percentage of injection attacks that succeeded. Formula: (Attacks Succeeded / Total Attacks) × 100. Lower is better — 0% means perfect security, 100% means all attacks succeeded.
  • Utility Score: Percentage of legitimate tasks completed while under attack. Formula: (Tasks Completed / Total Tasks) × 100. Higher is better — shows the agent maintains functionality despite adversarial conditions.
Loading...