153
Tasks
17
Themes
102
D4-D5 Tasks
36%
Hardest Domain Avg

Leaderboard

6 models, 153 tasks, 920 total runs. 17 compliance themes. D4-D5 tasks (67%) use F1 scoring. D1-D3 use detection scoring.

Rank Model Provider Avg F1 Highest Lowest
1 Claude Sonnet 4.6 Anthropic 70% 100% 7%
2 GPT-5.5 OpenAI 64% 100% 0%
3 Claude Opus 4.7 Anthropic 63% 100% 17%
4 Claude Haiku 4.5 Anthropic 59% 100% 0%
5 GPT-4.1 OpenAI 57% 100% 0%
6 GPT-4o OpenAI 40% 100% 0%

Per-Task Breakdown

All 20 tasks, grouped by difficulty. — indicates model not evaluated on that task.

Task D Control GPT-5.5 Sonnet Opus GPT-4.1 Haiku GPT-4o Avg
cc6.1-1-002D1Logical Access100%
cc8.1-1-001D1Change Mgmt55%
cc7.2-2-001D2Monitoring
cc6.1-3-001D3Logical Access100%100%100%89%100%78%94%
cc6.3-3-001D3Data Access
cc8.1-4-001D4Change Mgmt100%92%80%92%92%100%93%
a1.2-4-001D4Backup/Recovery100%100%62%80%32%60%72%
cc3.1-4-001D4Risk Assessment86%73%62%80%67%44%69%
cc9.1-4-001D4Vendor Mgmt100%73%67%67%80%18%67%
cc6.1-4-001D4Logical Access86%67%50%67%
cc6.6-4-001D4Sys Boundaries100%100%57%55%25%44%64%
cc6.3-4-001D4Data Access60%67%57%55%67%40%58%
cc7.2-4-001D4Monitoring86%89%67%40%42%20%57%
cc7.2-5-001D5Monitoring100%67%91%91%91%43%80%
cc8.1-5-001D5Change Mgmt73%83%91%71%43%80%74%
a1.2-5-001D5Backup/Recovery80%80%30%63%
cc6.1-5-001D5Logical Access89%83%67%62%30%31%60%
cc6.6-5-001D5Sys Boundaries18%55%33%20%27%20%29%
cc9.1-5-001D5Vendor Mgmt33%36%40%20%14%20%27%
cc3.1-5-001D5Risk Assessment57%55%40%0%12%0%27%

All Tasks

20 tasks across 8 controls. Click any task for full details — planted findings, evidence files, and per-model scoring breakdown.

D1-D2 — Detection

D3 — Cross-Reference

D4 — Red Herring Filtering

D5 — Materiality Judgment

Methodology

How TrustBench evaluates models.

Task Structure

Each task presents a model with synthetic audit evidence (policies, configs, logs, exception registers) and a SOC 2 control to assess. Evidence contains planted compliance gaps. The model produces an audit assessment. Scoring compares the model's findings against ground truth.

Difficulty Levels

D1-D2: Single document, gap detection. Calibration tasks.
D3: Multi-document cross-referencing. Policy contradicts config.
D4: Red herrings (gaps with valid exceptions), noise documents. Tests precision.
D5: Materiality judgment. Ambiguous scenarios where auditors disagree. No single right answer.

Scoring

D1-D2: Detection only (recall).
D3: Detection + precision penalty for over-reporting.
D4-D5: F1 score. Red herrings flagged as findings count as false positives. Models that over-report are penalized. Keyword-based, deterministic, no LLM-as-judge.

Scoring Caveat: Over-Reporting Penalty

F1 scoring penalizes models that report more findings than the planted ground truth. In a real audit, extra findings might be legitimate — an observation a human auditor would include. But keyword-based scoring cannot distinguish "useful extra finding" from "noise." This favors concise models (Sonnet: 7.4 avg findings/task) over thorough ones (Opus: 9.6, GPT-5.5: 8.5). Sonnet's lead is partly driven by precision, not just recall. An LLM-as-judge secondary scorer is planned for v2 to address this.

Models Evaluated

All 13 D4-D5 tasks: GPT-5.5 (OpenAI), Claude Opus 4.7 (Anthropic), Claude Sonnet 4.6 (Anthropic).
D3 only: Additionally Claude Haiku 4.5, Claude Opus 4.6, GPT-4.1, GPT-4o, GPT-4o-mini, o3.
Partial D4-D5: GPT-4o on 3 of 13 tasks (excluded from average).

Quick Start

Clone the repo and run locally.

git clone https://github.com/varungurnaney/trustbench.git
cd trustbench
pip install -r requirements.txt

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# or: export OPENAI_API_KEY=sk-...

# Run a single task
python3 -m trustbench.cli run tasks/cc8.1-4-001 --model claude-sonnet-4-6

# Run all D4-D5 tasks
python3 -m trustbench.cli run-all --model claude-sonnet-4-6 --difficulty 4
python3 -m trustbench.cli run-all --model claude-sonnet-4-6 --difficulty 5

# See results
python3 -m trustbench.cli leaderboard

See the authoring guide to contribute tasks.