TrustBench — Security Compliance Benchmark for LLMs

153

Tasks

Themes

102

D4-D5 Tasks

36%

Hardest Domain Avg

Leaderboard

6 models, 153 tasks, 920 total runs. 17 compliance themes. D4-D5 tasks (67%) use F1 scoring. D1-D3 use detection scoring.

Rank	Model	Provider	Avg F1	Highest	Lowest
1	Claude Sonnet 4.6	Anthropic	70%	100%	7%
2	GPT-5.5	OpenAI	64%	100%	0%
3	Claude Opus 4.7	Anthropic	63%	100%	17%
4	Claude Haiku 4.5	Anthropic	59%	100%	0%
5	GPT-4.1	OpenAI	57%	100%	0%
6	GPT-4o	OpenAI	40%	100%	0%

Per-Task Breakdown

All 20 tasks, grouped by difficulty. — indicates model not evaluated on that task.

Task	D	Control	GPT-5.5	Sonnet	Opus	GPT-4.1	Haiku	GPT-4o	Avg
cc6.1-1-002	D1	Logical Access	—	—	100%	—	—	—	—
cc8.1-1-001	D1	Change Mgmt	—	55%	—	—	—	—	—
cc7.2-2-001	D2	Monitoring	—	—	—	—	—	—	—
cc6.1-3-001	D3	Logical Access	100%	100%	100%	89%	100%	78%	94%
cc6.3-3-001	D3	Data Access	—	—	—	—	—	—	—
cc8.1-4-001	D4	Change Mgmt	100%	92%	80%	92%	92%	100%	93%
a1.2-4-001	D4	Backup/Recovery	100%	100%	62%	80%	32%	60%	72%
cc3.1-4-001	D4	Risk Assessment	86%	73%	62%	80%	67%	44%	69%
cc9.1-4-001	D4	Vendor Mgmt	100%	73%	67%	67%	80%	18%	67%
cc6.1-4-001	D4	Logical Access	—	—	—	86%	67%	50%	67%
cc6.6-4-001	D4	Sys Boundaries	100%	100%	57%	55%	25%	44%	64%
cc6.3-4-001	D4	Data Access	60%	67%	57%	55%	67%	40%	58%
cc7.2-4-001	D4	Monitoring	86%	89%	67%	40%	42%	20%	57%
cc7.2-5-001	D5	Monitoring	100%	67%	91%	91%	91%	43%	80%
cc8.1-5-001	D5	Change Mgmt	73%	83%	91%	71%	43%	80%	74%
a1.2-5-001	D5	Backup/Recovery	—	—	—	80%	80%	30%	63%
cc6.1-5-001	D5	Logical Access	89%	83%	67%	62%	30%	31%	60%
cc6.6-5-001	D5	Sys Boundaries	18%	55%	33%	20%	27%	20%	29%
cc9.1-5-001	D5	Vendor Mgmt	33%	36%	40%	20%	14%	20%	27%
cc3.1-5-001	D5	Risk Assessment	57%	55%	40%	0%	12%	0%	27%

All Tasks

20 tasks across 8 controls. Click any task for full details — planted findings, evidence files, and per-model scoring breakdown.

D1-D2 — Detection

D3 — Cross-Reference

cc6.1-3-001 — Logical Access

CC6.1 · 9 gaps, 6 evidence docs · Avg: 89%

cc6.3-3-001 — Data Access

CC6.3 · 5 gaps, 3 evidence docs · Avg: 87%

D4 — Red Herring Filtering

cc8.1-4-001 — Change Mgmt

CC8.1 · 6 gaps, 2 red herrings, 2 noise · Avg: 90%

cc6.1-4-001 — Logical Access

CC6.1 · 3 gaps, 1 red herring, 2 noise · Avg: 80%

a1.2-4-001 — Backup/Recovery

A1.2 · 4 gaps, 1 red herring, 2 noise · Avg: 69%

cc9.1-4-001 — Vendor Mgmt

CC9.1 · 4 gaps, 2 red herrings, 1 noise · Avg: 67%

cc3.1-4-001 — Risk Assessment

CC3.1 · 4 gaps, 1 red herring, 1 noise · Avg: 63%

cc6.3-4-001 — Data Access

CC6.3 · 4 gaps, 1 red herring, 1 noise · Avg: 58%

cc6.6-4-001 — System Boundaries

CC6.6 · 4 gaps, 1 red herring, 1 noise · Avg: 57%

cc7.2-4-001 — Monitoring

CC7.2 · 4 gaps, 1 red herring, 2 noise · Avg: 51%

D5 — Materiality Judgment

cc8.1-5-001 — Change Mgmt

CC8.1 · 5 gaps, 1 noise · Avg: 74%

cc7.2-5-001 — Monitoring

CC7.2 · 5 gaps, 1 noise · Avg: 72%

a1.2-5-001 — Backup/Recovery

A1.2 · 4 gaps · Avg: 70%

cc6.1-5-001 — Logical Access

CC6.1 · 5 gaps · Avg: 55%

cc6.6-5-001 — System Boundaries

CC6.6 · 4 gaps · Avg: 28%

cc9.1-5-001 — Vendor Mgmt

CC9.1 · 5 gaps, 1 noise · Avg: 24%

cc3.1-5-001 — Risk Assessment

CC3.1 · 5 gaps, 1 noise · Avg: 23%

Methodology

How TrustBench evaluates models.

Task Structure

Each task presents a model with synthetic audit evidence (policies, configs, logs, exception registers) and a SOC 2 control to assess. Evidence contains planted compliance gaps. The model produces an audit assessment. Scoring compares the model's findings against ground truth.

Difficulty Levels

D1-D2: Single document, gap detection. Calibration tasks.
D3: Multi-document cross-referencing. Policy contradicts config.
D4: Red herrings (gaps with valid exceptions), noise documents. Tests precision.
D5: Materiality judgment. Ambiguous scenarios where auditors disagree. No single right answer.

Scoring

D1-D2: Detection only (recall).
D3: Detection + precision penalty for over-reporting.
D4-D5: F1 score. Red herrings flagged as findings count as false positives. Models that over-report are penalized. Keyword-based, deterministic, no LLM-as-judge.

Scoring Caveat: Over-Reporting Penalty

F1 scoring penalizes models that report more findings than the planted ground truth. In a real audit, extra findings might be legitimate — an observation a human auditor would include. But keyword-based scoring cannot distinguish "useful extra finding" from "noise." This favors concise models (Sonnet: 7.4 avg findings/task) over thorough ones (Opus: 9.6, GPT-5.5: 8.5). Sonnet's lead is partly driven by precision, not just recall. An LLM-as-judge secondary scorer is planned for v2 to address this.

Models Evaluated

All 13 D4-D5 tasks: GPT-5.5 (OpenAI), Claude Opus 4.7 (Anthropic), Claude Sonnet 4.6 (Anthropic).
D3 only: Additionally Claude Haiku 4.5, Claude Opus 4.6, GPT-4.1, GPT-4o, GPT-4o-mini, o3.
Partial D4-D5: GPT-4o on 3 of 13 tasks (excluded from average).

Quick Start

Clone the repo and run locally.

git clone https://github.com/varungurnaney/trustbench.git
cd trustbench
pip install -r requirements.txt

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# or: export OPENAI_API_KEY=sk-...

# Run a single task
python3 -m trustbench.cli run tasks/cc8.1-4-001 --model claude-sonnet-4-6

# Run all D4-D5 tasks
python3 -m trustbench.cli run-all --model claude-sonnet-4-6 --difficulty 4
python3 -m trustbench.cli run-all --model claude-sonnet-4-6 --difficulty 5

# See results
python3 -m trustbench.cli leaderboard

See the authoring guide to contribute tasks.