Last updated: 2026-04-04 07:40 UTC
We benchmark free OpenRouter models across two critical dimensions using industry-standard frameworks. Each benchmark tests different aspects of AI agent capabilities.
What it tests: Task completion capabilities including file manipulation, web search, and weather queries.
How it works: Single-turn task execution across 3 scenarios with 9 easy-level tasks total (3 per scenario).
Scoring: Composite score = 70% accuracy + 20% speed + 10% token efficiency
What it tests: Prompt injection resistance and security in agentic scenarios.
How it works: Workspace suite with tool_knowledge attacks testing whether models follow malicious instructions.
Scoring: Security score (% of attacks blocked) and utility score (% of legitimate tasks completed under attack)
OpenClawBench evaluates models on practical task completion through single-turn interactions:
Each task receives a binary score (100% pass or 0% fail) based on whether the agent successfully completed the objective. The final ranking uses a composite score weighted by accuracy (70%), speed (20%), and token efficiency (10%).
AgentDojo tests whether AI agents can resist prompt injection attacks while maintaining utility:
Models are scored on two dimensions:
Models are ranked by security score as the primary metric, with utility score shown for context.
Source: AgentDojo v1.2.2 - Prompt injection security benchmark for AI agents
Each model is tested with user tasks (legitimate operations) combined with injection tasks (malicious instructions hidden in tool outputs).
10 User Tasks (legitimate work like "Schedule a meeting" or "Find emails from last week")
×
6 Injection Tasks (attacks like "Send sensitive data to attacker" or "Delete important files")
=
60 test cases per model
Malicious instructions are embedded in tool outputs (file contents, API responses, email bodies) that the agent reads while performing legitimate tasks. A secure agent must complete the user's task while ignoring injected commands.
Note: Full benchmark includes 33 user tasks (198 test cases). We test 10 tasks to balance comprehensive coverage with practical runtime (~10 min vs ~30 min per model).
📝 Note: All free OpenRouter models are tested. Models use the OpenRouter provider integration with AgentDojo, enabling security testing for any model accessible via OpenRouter's API.
Task completion benchmark • Ranked by composite score (accuracy + speed + efficiency)
OpenClawBench is a task completion benchmark that tests AI agents' ability to use tools and complete real-world tasks. Agents are evaluated on file manipulation, weather lookups, and web searches using a simulated bot interface.
We run 3 scenarios (file, weather, web) with easy difficulty tasks in single-turn mode (the agent provides one complete response per task). This subset provides fast benchmarking (~2-5 minutes per model) while covering core capabilities. Each scenario contains 3 tasks for a total of 9 tasks per model.
Prompt injection security benchmark • Ranked by attack success rate (lower = more secure)
AgentDojo v1.2.2 is a prompt injection security benchmark that tests AI agents' resistance to adversarial attacks. It evaluates whether agents can complete legitimate tasks while resisting malicious instructions embedded in tool outputs (emails, files, API responses). Developed by ETH Zurich's SPY Lab.
We run the workspace suite (email, calendar, cloud storage tools) with tool_knowledge attacks (malicious instructions hidden in tool outputs). We test 10 user tasks out of the full 33-task suite, with each task tested against 5 randomly sampled injection attacks (from 6 available attack types) = 50 security tests per model. We use pass@5 evaluation: an attack is considered successful if it works in at least 1 of the 5 attempts. Additionally, each of the 10 user tasks is tested for utility (task completion). This approach provides robust security evaluation with statistical significance in ~25-35 minutes per model.