An open benchmark for measuring how well LLMs perform security compliance auditing. 20 tasks across 8 SOC 2 controls. F1 scoring with red herrings, noise documents, and judgment calls.
6 models, 153 tasks, 920 total runs. 17 compliance themes. D4-D5 tasks (67%) use F1 scoring. D1-D3 use detection scoring.
| Rank | Model | Provider | Avg F1 | Highest | Lowest |
|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 70% | 100% | 7% |
| 2 | GPT-5.5 | OpenAI | 64% | 100% | 0% |
| 3 | Claude Opus 4.7 | Anthropic | 63% | 100% | 17% |
| 4 | Claude Haiku 4.5 | Anthropic | 59% | 100% | 0% |
| 5 | GPT-4.1 | OpenAI | 57% | 100% | 0% |
| 6 | GPT-4o | OpenAI | 40% | 100% | 0% |
All 20 tasks, grouped by difficulty. — indicates model not evaluated on that task.
| Task | D | Control | GPT-5.5 | Sonnet | Opus | GPT-4.1 | Haiku | GPT-4o | Avg |
|---|---|---|---|---|---|---|---|---|---|
| cc6.1-1-002 | D1 | Logical Access | — | — | 100% | — | — | — | — |
| cc8.1-1-001 | D1 | Change Mgmt | — | 55% | — | — | — | — | — |
| cc7.2-2-001 | D2 | Monitoring | — | — | — | — | — | — | — |
| cc6.1-3-001 | D3 | Logical Access | 100% | 100% | 100% | 89% | 100% | 78% | 94% |
| cc6.3-3-001 | D3 | Data Access | — | — | — | — | — | — | — |
| cc8.1-4-001 | D4 | Change Mgmt | 100% | 92% | 80% | 92% | 92% | 100% | 93% |
| a1.2-4-001 | D4 | Backup/Recovery | 100% | 100% | 62% | 80% | 32% | 60% | 72% |
| cc3.1-4-001 | D4 | Risk Assessment | 86% | 73% | 62% | 80% | 67% | 44% | 69% |
| cc9.1-4-001 | D4 | Vendor Mgmt | 100% | 73% | 67% | 67% | 80% | 18% | 67% |
| cc6.1-4-001 | D4 | Logical Access | — | — | — | 86% | 67% | 50% | 67% |
| cc6.6-4-001 | D4 | Sys Boundaries | 100% | 100% | 57% | 55% | 25% | 44% | 64% |
| cc6.3-4-001 | D4 | Data Access | 60% | 67% | 57% | 55% | 67% | 40% | 58% |
| cc7.2-4-001 | D4 | Monitoring | 86% | 89% | 67% | 40% | 42% | 20% | 57% |
| cc7.2-5-001 | D5 | Monitoring | 100% | 67% | 91% | 91% | 91% | 43% | 80% |
| cc8.1-5-001 | D5 | Change Mgmt | 73% | 83% | 91% | 71% | 43% | 80% | 74% |
| a1.2-5-001 | D5 | Backup/Recovery | — | — | — | 80% | 80% | 30% | 63% |
| cc6.1-5-001 | D5 | Logical Access | 89% | 83% | 67% | 62% | 30% | 31% | 60% |
| cc6.6-5-001 | D5 | Sys Boundaries | 18% | 55% | 33% | 20% | 27% | 20% | 29% |
| cc9.1-5-001 | D5 | Vendor Mgmt | 33% | 36% | 40% | 20% | 14% | 20% | 27% |
| cc3.1-5-001 | D5 | Risk Assessment | 57% | 55% | 40% | 0% | 12% | 0% | 27% |
20 tasks across 8 controls. Click any task for full details — planted findings, evidence files, and per-model scoring breakdown.
How TrustBench evaluates models.
Clone the repo and run locally.
git clone https://github.com/varungurnaney/trustbench.git
cd trustbench
pip install -r requirements.txt
# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# or: export OPENAI_API_KEY=sk-...
# Run a single task
python3 -m trustbench.cli run tasks/cc8.1-4-001 --model claude-sonnet-4-6
# Run all D4-D5 tasks
python3 -m trustbench.cli run-all --model claude-sonnet-4-6 --difficulty 4
python3 -m trustbench.cli run-all --model claude-sonnet-4-6 --difficulty 5
# See results
python3 -m trustbench.cli leaderboard
See the authoring guide to contribute tasks.