TrustBench: Can Language Models Audit Security Compliance?

Varun Gurnaney
May 2026

Abstract

We introduce TrustBench, a benchmark for evaluating large language models on security compliance auditing. TrustBench presents models with synthetic audit evidence — policies, configurations, access logs, termination records, and exception registers — containing planted compliance gaps, and measures detection accuracy against ground truth. The benchmark includes 20 tasks across 8 SOC 2 controls at five difficulty levels, from single-document gap detection (D1) to materiality judgment under ambiguity (D5). D4-D5 tasks, which comprise 75% of the benchmark, test red herring filtering, noise document handling, and professional judgment. We evaluate 6 models from Anthropic and OpenAI. Claude Sonnet 4.6 leads at 82% average score, followed by Claude Opus 4.7 (72%) and GPT-5.5 (71%). Three D5 tasks average below 28% across all models, indicating that compliance judgment remains beyond current LLM capabilities. Scoring uses deterministic keyword matching with F1 on D4-D5 tasks; we document a systematic limitation where F1 penalizes thorough models that report legitimate extra findings. The benchmark schema is framework-agnostic and supports ISO 27001, HIPAA, PCI-DSS, and any control-based standard. Code, tasks, and results are publicly available.

1. Introduction

Security compliance auditing is a labor-intensive process. A typical SOC 2 Type II audit involves reviewing hundreds of documents — access control policies, IAM configurations, change management logs, backup records — to determine whether an organization's controls are operating effectively. This is exactly the kind of document-heavy, cross-referencing, judgment-intensive work that large language models should excel at.

GRC (governance, risk, and compliance) teams are already using LLMs for evidence review, gap analysis, control mapping, and questionnaire responses. Yet no benchmark exists to evaluate how well models perform these tasks. The closest prior work is AIReg-Bench (Sherborne et al., 2025), which tests LLMs against 120 EU AI Act compliance samples — a single regulation, a single document type, at proof-of-concept scale.

Other benchmarks evaluate adjacent capabilities. SWE-bench (Jimenez et al., 2024) tests software engineering through real GitHub issues. CyBench (Zhang et al., 2024) and CyberGym (Wang et al., 2025) evaluate cybersecurity through CTF challenges and vulnerability analysis. GAIA (Mialon et al., 2023) tests general reasoning with tools. None test compliance auditing — the multi-document cross-referencing and materiality judgment that compliance teams perform daily.

We introduce TrustBench, a benchmark designed to fill this gap. TrustBench evaluates whether models can: (1) detect compliance gaps in policy documents, (2) cross-reference policies against technical configurations and operational logs, (3) distinguish real gaps from documented exceptions, and (4) make materiality judgments on ambiguous findings.

2. Benchmark Design

2.1 Task Structure

Each TrustBench task simulates an audit scenario. The model receives a compliance control to assess, an evidence package (1-8 documents), and an instruction to produce an audit report. Evidence documents include policies (.md), IAM configurations (.json), MFA policies (.json), access logs (.csv), termination records (.csv), exception registers (.csv), and screenshots (.png). Hidden in the evidence are planted compliance gaps. The model's job is to find them.

2.2 Difficulty Levels

Tasks span five difficulty levels, each testing a distinct audit skill:

D1 (2 tasks): Single-document gap detection. The gap is stated in the evidence.
D2 (1 task): Careful reading. The gap is buried in a table or timeline.
D3 (2 tasks): Cross-referencing. The gap becomes visible only when comparing multiple documents.
D4 (8 tasks): Red herring filtering. Some apparent gaps have valid explanations in the exception register. Noise documents are included.
D5 (7 tasks): Materiality judgment. Ambiguous scenarios where reasonable auditors would disagree.

D4-D5 tasks comprise 75% of the benchmark. D1-D3 tasks serve as calibration.

Figure 1: Average score by difficulty level across 6 models. Performance degrades from D1 (94%) to D5 (50%). D4-D5 tasks produce the most model differentiation.

2.3 Controls and Evidence

The initial task set covers 8 SOC 2 Trust Service Criteria controls: CC6.1 (Logical Access), CC6.3 (Access Authorization), CC6.6 (System Boundaries), CC7.2 (Monitoring), CC8.1 (Change Management), CC3.1 (Risk Assessment), CC9.1 (Vendor Management), and A1.2 (Backup & Recovery). The schema is framework-agnostic — the framework and control fields accept any string.

2.4 Scoring

Scoring is deterministic and keyword-based. Each planted finding has a set of detection keywords and a minimum match threshold. D1-D2 tasks use detection scoring (recall = gaps detected / total gaps). D3 tasks use detection with a precision penalty. D4-D5 tasks use F1 — the harmonic mean of recall (gaps found / total gaps) and precision (true positives / total model findings). Findings of type red_herring count as false positives if flagged without exclusion keywords.

3. Experimental Setup

We evaluate 6 models on all 20 tasks (120 total runs):

Anthropic: Claude Sonnet 4.6, Claude Opus 4.7, Claude Haiku 4.5
OpenAI: GPT-5.5, GPT-4.1, GPT-4o

Each model receives the same prompt and evidence. Output is free-text — compliance assessments are narrative, and we do not constrain the output format. Temperature is set to default for all models. Each task is run once per model.

4. Results

4.1 Overall Performance

Rank	Model	Provider	Avg Score	Avg Recall	Avg Precision	Avg Findings/Task
1	Claude Sonnet 4.6	Anthropic	82%	92%	63%	7.3
2	Claude Opus 4.7	Anthropic	72%	94%	47%	9.6
3	GPT-5.5	OpenAI	71%	82%	50%	8.7
4	GPT-4.1	OpenAI	61%	71%	48%	6.8
5	Claude Haiku 4.5	Anthropic	61%	90%	39%	14.6
6	GPT-4o	OpenAI	44%	55%	38%	7.2

Table 1: Overall performance across 20 tasks. Score is averaged per-task score (detection for D1-D3, F1 for D4-D5).

Figure 2: Recall vs Precision across all models. Bubble size represents average findings reported per task. Opus has the highest recall (94%) but lowest precision (47%) due to over-reporting. Sonnet balances both.

4.2 Performance by Difficulty

Figure 3: Score by difficulty level per model. All models perform well on D1 (88-100%). Performance diverges sharply at D4-D5. GPT-4o and GPT-4.1 score 0% on D2 due to missing a subtle gap in a single-document task.

4.3 Performance by Control Domain

Figure 4: Average score by control domain across all 6 models. Change Management (CC8.1) is the easiest domain (88% avg). Vendor Management (CC9.1) and Risk Assessment (CC3.1) are the hardest (46% and 43% avg). System Boundaries (CC6.6) shows the widest model variance.

4.4 D5 Judgment Tasks

The hardest tasks in the benchmark are D5 judgment tasks where reasonable auditors would disagree. Three tasks average below 28%:

cc3.1-5-001 (Risk Assessment, 23%): Risk downgrades without documented justification, 11-month-old annual assessment, partial SGC review scope.
cc9.1-5-001 (Vendor Management, 24%): Vendor met 72-hour notification SLA but 3-week uncertainty on customer data exposure. Qualified SOC 2 opinion acknowledged but no risk assessment documented.
cc6.6-5-001 (System Boundaries, 28%): Admin API bypasses API gateway — VPN-restricted, accepted risk in penetration test, 0.3% of traffic.

Figure 5: D5 task scores by model. No model scores above 40% on the three hardest tasks. Opus leads on Change Management (91%) and Monitoring (91%). Sonnet leads on Backup/Recovery (89%).

5. Analysis

5.1 Recall Is Easy; Precision Separates

Most frontier models achieve high recall — they find the planted gaps. Scores diverge on precision: how many extra findings the model reports. Haiku averages 14.6 findings per task (the highest) but has the lowest precision (39%). Sonnet reports 7.3 findings per task and achieves 63% precision. This suggests that compliance audit capability is not about finding gaps — it's about not over-reporting.

5.2 The Over-Reporting Penalty

F1 scoring penalizes models that report findings beyond the planted ground truth. This is a deliberate design choice: in production, false positives cost analyst time. However, some extra findings may be legitimate compliance observations. Opus reports 9.6 findings per task — the extras often include valid observations like "emergency approval via Slack DM lacks audit trail" or "standard change deployed outside maintenance window." These are penalized as false positives under keyword scoring.

This means the current scoring favors concise models over thorough ones. Sonnet's lead over Opus (82% vs 72%) is partly driven by this effect. However, Sonnet also has genuinely higher recall on D4-D5 tasks (100% on 16/20 tasks vs Opus at 100% on fewer), so the lead is not purely a precision artifact.

5.3 Model-Specific Blind Spots

No model dominates across all controls. GPT-4o scores 93% on Change Management but 19% on Vendor Management. Sonnet scores 96% on Logical Access but 55% on Vendor Management. These domain-specific performance gaps suggest that compliance capability is not a single dimension — models have control-specific strengths and weaknesses.

Figure 6: Min vs Max score per model. The spread (max - min) ranges from 64% (Sonnet) to 100% (GPT-4.1, GPT-4o). Every model has at least one control domain where it scores below 40%.

6. Limitations

Task count. 20 tasks across 8 controls is sufficient for directional findings but not statistically robust. We target 64+ tasks for v1.
Keyword scoring. Detection keywords cannot capture all valid phrasings. Some true detections may be scored as misses if the model uses unexpected language.
Over-reporting penalty. F1 penalizes legitimate extra findings. An LLM-as-judge secondary scorer is planned for v2.
Single framework. All current tasks cover SOC 2. The schema supports other frameworks but no ISO 27001, HIPAA, or PCI-DSS tasks exist yet.
Synthetic evidence. Evidence documents are synthetic. While modeled on real audit artifacts, they may not capture the full complexity of production compliance environments.
Single run. Each model is evaluated once per task. Non-deterministic model outputs mean scores may vary across runs.

7. Related Work

Code benchmarks. SWE-bench (Jimenez et al., 2024) evaluates LLMs on real GitHub issues across 12 Python repositories. SWE-bench Pro (Scale AI, 2025) extends this to 1,865 tasks across 41 repos in multiple languages. Both use test-suite evaluation against gold-standard patches.

Cybersecurity benchmarks. CyBench (Zhang et al., 2024) evaluates agents on 40 CTF challenges with subtask decomposition. CyberGym (Wang et al., 2025) uses 1,507 real CVEs for proof-of-concept exploit generation. Both use deterministic evaluation (flag matching, crash/no-crash).

Compliance evaluation. AIReg-Bench (Sherborne et al., 2025) tests compliance assessment against EU AI Act articles with 120 expert-annotated samples. LegalBench (Guha et al., 2023) tests legal reasoning across 162 tasks. CUAD (Hendrycks et al., 2021) tests contract clause extraction. None test multi-document compliance auditing with exception handling and materiality judgment.

General reasoning. GAIA (Mialon et al., 2023) tests multi-step reasoning with tools. GPQA (Rein et al., 2023) tests graduate-level domain knowledge. Both use exact-match evaluation.

TrustBench is distinguished by its focus on compliance-specific skills: cross-referencing multiple evidence types, validating exceptions, filtering noise, and making materiality judgments.

8. Conclusion

TrustBench demonstrates that LLMs can detect compliance gaps that require cross-referencing multiple documents — a capability with direct practical value for GRC teams. However, three findings temper this optimism: (1) no model scores above 28% on materiality judgment tasks, (2) models systematically over-report, creating false positives that would waste analyst time, and (3) performance varies significantly by control domain, with no single model dominating across all areas.

The benchmark is early-stage (20 tasks, one framework) and the scoring methodology has documented limitations around the over-reporting penalty. We plan to expand to 64+ tasks, add ISO 27001 and HIPAA task sets, and introduce an LLM-as-judge secondary scorer for v2.

Code, tasks, and results are available at github.com/varungurnaney/trustbench.

Acknowledgments. The benchmark design was informed by the architectural patterns established by SWE-bench, CyBench, CyberGym, and GAIA.