We introduce TrustBench, a benchmark for evaluating large language models on security compliance auditing. TrustBench presents models with synthetic audit evidence — policies, configurations, access logs, termination records, and exception registers — containing planted compliance gaps, and measures detection accuracy against ground truth. The benchmark includes 20 tasks across 8 SOC 2 controls at five difficulty levels, from single-document gap detection (D1) to materiality judgment under ambiguity (D5). D4-D5 tasks, which comprise 75% of the benchmark, test red herring filtering, noise document handling, and professional judgment. We evaluate 6 models from Anthropic and OpenAI. Claude Sonnet 4.6 leads at 82% average score, followed by Claude Opus 4.7 (72%) and GPT-5.5 (71%). Three D5 tasks average below 28% across all models, indicating that compliance judgment remains beyond current LLM capabilities. Scoring uses deterministic keyword matching with F1 on D4-D5 tasks; we document a systematic limitation where F1 penalizes thorough models that report legitimate extra findings. The benchmark schema is framework-agnostic and supports ISO 27001, HIPAA, PCI-DSS, and any control-based standard. Code, tasks, and results are publicly available.
Security compliance auditing is a labor-intensive process. A typical SOC 2 Type II audit involves reviewing hundreds of documents — access control policies, IAM configurations, change management logs, backup records — to determine whether an organization's controls are operating effectively. This is exactly the kind of document-heavy, cross-referencing, judgment-intensive work that large language models should excel at.
GRC (governance, risk, and compliance) teams are already using LLMs for evidence review, gap analysis, control mapping, and questionnaire responses. Yet no benchmark exists to evaluate how well models perform these tasks. The closest prior work is AIReg-Bench (Sherborne et al., 2025), which tests LLMs against 120 EU AI Act compliance samples — a single regulation, a single document type, at proof-of-concept scale.
Other benchmarks evaluate adjacent capabilities. SWE-bench (Jimenez et al., 2024) tests software engineering through real GitHub issues. CyBench (Zhang et al., 2024) and CyberGym (Wang et al., 2025) evaluate cybersecurity through CTF challenges and vulnerability analysis. GAIA (Mialon et al., 2023) tests general reasoning with tools. None test compliance auditing — the multi-document cross-referencing and materiality judgment that compliance teams perform daily.
We introduce TrustBench, a benchmark designed to fill this gap. TrustBench evaluates whether models can: (1) detect compliance gaps in policy documents, (2) cross-reference policies against technical configurations and operational logs, (3) distinguish real gaps from documented exceptions, and (4) make materiality judgments on ambiguous findings.
Each TrustBench task simulates an audit scenario. The model receives a compliance control to assess, an evidence package (1-8 documents), and an instruction to produce an audit report. Evidence documents include policies (.md), IAM configurations (.json), MFA policies (.json), access logs (.csv), termination records (.csv), exception registers (.csv), and screenshots (.png). Hidden in the evidence are planted compliance gaps. The model's job is to find them.
Tasks span five difficulty levels, each testing a distinct audit skill:
D4-D5 tasks comprise 75% of the benchmark. D1-D3 tasks serve as calibration.
The initial task set covers 8 SOC 2 Trust Service Criteria controls: CC6.1 (Logical Access), CC6.3 (Access Authorization), CC6.6 (System Boundaries), CC7.2 (Monitoring), CC8.1 (Change Management), CC3.1 (Risk Assessment), CC9.1 (Vendor Management), and A1.2 (Backup & Recovery). The schema is framework-agnostic — the framework and control fields accept any string.
Scoring is deterministic and keyword-based. Each planted finding has a set of detection keywords and a minimum match threshold. D1-D2 tasks use detection scoring (recall = gaps detected / total gaps). D3 tasks use detection with a precision penalty. D4-D5 tasks use F1 — the harmonic mean of recall (gaps found / total gaps) and precision (true positives / total model findings). Findings of type red_herring count as false positives if flagged without exclusion keywords.
We evaluate 6 models on all 20 tasks (120 total runs):
Each model receives the same prompt and evidence. Output is free-text — compliance assessments are narrative, and we do not constrain the output format. Temperature is set to default for all models. Each task is run once per model.
| Rank | Model | Provider | Avg Score | Avg Recall | Avg Precision | Avg Findings/Task |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 82% | 92% | 63% | 7.3 |
| 2 | Claude Opus 4.7 | Anthropic | 72% | 94% | 47% | 9.6 |
| 3 | GPT-5.5 | OpenAI | 71% | 82% | 50% | 8.7 |
| 4 | GPT-4.1 | OpenAI | 61% | 71% | 48% | 6.8 |
| 5 | Claude Haiku 4.5 | Anthropic | 61% | 90% | 39% | 14.6 |
| 6 | GPT-4o | OpenAI | 44% | 55% | 38% | 7.2 |
The hardest tasks in the benchmark are D5 judgment tasks where reasonable auditors would disagree. Three tasks average below 28%:
Most frontier models achieve high recall — they find the planted gaps. Scores diverge on precision: how many extra findings the model reports. Haiku averages 14.6 findings per task (the highest) but has the lowest precision (39%). Sonnet reports 7.3 findings per task and achieves 63% precision. This suggests that compliance audit capability is not about finding gaps — it's about not over-reporting.
F1 scoring penalizes models that report findings beyond the planted ground truth. This is a deliberate design choice: in production, false positives cost analyst time. However, some extra findings may be legitimate compliance observations. Opus reports 9.6 findings per task — the extras often include valid observations like "emergency approval via Slack DM lacks audit trail" or "standard change deployed outside maintenance window." These are penalized as false positives under keyword scoring.
This means the current scoring favors concise models over thorough ones. Sonnet's lead over Opus (82% vs 72%) is partly driven by this effect. However, Sonnet also has genuinely higher recall on D4-D5 tasks (100% on 16/20 tasks vs Opus at 100% on fewer), so the lead is not purely a precision artifact.
No model dominates across all controls. GPT-4o scores 93% on Change Management but 19% on Vendor Management. Sonnet scores 96% on Logical Access but 55% on Vendor Management. These domain-specific performance gaps suggest that compliance capability is not a single dimension — models have control-specific strengths and weaknesses.
Code benchmarks. SWE-bench (Jimenez et al., 2024) evaluates LLMs on real GitHub issues across 12 Python repositories. SWE-bench Pro (Scale AI, 2025) extends this to 1,865 tasks across 41 repos in multiple languages. Both use test-suite evaluation against gold-standard patches.
Cybersecurity benchmarks. CyBench (Zhang et al., 2024) evaluates agents on 40 CTF challenges with subtask decomposition. CyberGym (Wang et al., 2025) uses 1,507 real CVEs for proof-of-concept exploit generation. Both use deterministic evaluation (flag matching, crash/no-crash).
Compliance evaluation. AIReg-Bench (Sherborne et al., 2025) tests compliance assessment against EU AI Act articles with 120 expert-annotated samples. LegalBench (Guha et al., 2023) tests legal reasoning across 162 tasks. CUAD (Hendrycks et al., 2021) tests contract clause extraction. None test multi-document compliance auditing with exception handling and materiality judgment.
General reasoning. GAIA (Mialon et al., 2023) tests multi-step reasoning with tools. GPQA (Rein et al., 2023) tests graduate-level domain knowledge. Both use exact-match evaluation.
TrustBench is distinguished by its focus on compliance-specific skills: cross-referencing multiple evidence types, validating exceptions, filtering noise, and making materiality judgments.
TrustBench demonstrates that LLMs can detect compliance gaps that require cross-referencing multiple documents — a capability with direct practical value for GRC teams. However, three findings temper this optimism: (1) no model scores above 28% on materiality judgment tasks, (2) models systematically over-report, creating false positives that would waste analyst time, and (3) performance varies significantly by control domain, with no single model dominating across all areas.
The benchmark is early-stage (20 tasks, one framework) and the scoring methodology has documented limitations around the over-reporting penalty. We plan to expand to 64+ tasks, add ISO 27001 and HIPAA task sets, and introduce an LLM-as-judge secondary scorer for v2.
Code, tasks, and results are available at github.com/varungurnaney/trustbench.
Acknowledgments. The benchmark design was informed by the architectural patterns established by SWE-bench, CyBench, CyberGym, and GAIA.