Evaluating LLMs for Legal and Regulatory Compliance: A Structured Benchmark by Contractzlab

Introduction

At Contractzlab, our mission is to transform legal and regulatory workflows with AI-native solutions built for compliance precision. In this pursuit, we have conducted a scientifically rigorous benchmark evaluating the capabilities of generalist Large Language Models (LLMs) on a wide range of legal and regulatory tasks. Our evaluation combines cognitive science (via Bloom’s taxonomy), AI performance metrics, and domain-specific knowledge in IT and privacy regulations.

A Cognitive Benchmark for Legal Reasoning

Legal expertise is not merely about knowing the law—it’s about applying it correctly, interpreting ambiguous clauses, assessing risk, and proposing proportionate solutions. To evaluate whether LLMs are ready for these tasks, we structured our benchmark according to Bloom’s taxonomy, covering: • Memorization: Recalling legal articles and principles (e.g., reciting GDPR Article 6) • Understanding: Interpreting provisions and identifying relevant information • Application: Matching facts to laws and generating compliance reports • Analysis: Breaking down legal issues and identifying regulatory triggers • Evaluation: Assessing proportionality, risks, and alternative legal strategies • Creation: Drafting legal arguments or compliance measures from scratch

Our dataset spans three task types: 1. Multiple Choice Questions (MCQs): Assessing recall and basic understanding 2. Open-Ended Legal Reasoning Tasks: Evaluating interpretive and applied reasoning 3. Domain-Specific Compliance Scenarios: Measuring performance on GDPR, ePrivacy, and AI regulation tasks, such as cookie consent (Planet49 case) and AI/data minimization violations (Doctissimo case)

Experimental Setup & Metrics

We evaluated six models: GPT-4.1, GPT-4o, Mistral, Phi-4, DeepSeek, and Llama-4-Maverick. Tasks were labeled by type (e.g., analysis, creation) and by AI function (e.g., generation, extraction, single-label classification, multi-label classification). Metrics included: • Accuracy (MCQs, SLC tasks) • BLEU/ROUGE scores (generation tasks) • F1-score (MLC tasks) • Compliance alignment score (legal rules + correct application) • Expert interpretability and relevance rating Additionally, our SaaS platform Jessica was used to provide structured scoring with embedded legal rules and annotations.

Key Results & Observations

1. Analytical Strengths: Mistral and Phi-4 outperformed others on legal issue identification and scenario decomposition, showing strong pattern recognition. 2. Generative Superiority: GPT-4 and GPT-4o demonstrated leading capabilities in constructing coherent and relevant legal conclusions, especially for creative synthesis tasks. 3. Domain-Specific Gaps: None of the models consistently applied legal rules (e.g., GDPR Article 4(11), ePrivacy Directive Article 5(3)) correctly without factual hallucination or misapplication, especially when scenarios involved risk assessment or counterfactual analysis. 4. Sensitivity to Structure: Models performed better on tasks that followed a structured prompt design (e.g., facts, legal rules, issue, application, conclusion), reaffirming the benefit of instructional prompting. We provide full visualizations and comparative figures in our benchmark report.

Conclusion & Future Work: Introducing Mike

Our analysis reveals that while generalist LLMs have impressive capabilities, their performance is inconsistent across legal tasks—particularly those requiring context-aware evaluation, critical assessment, or creative compliance design. The field requires domain-trained models that are not only linguistically fluent but legally grounded. To address this gap, Contractzlab has developed “Mike”, an in-house LLM trained on a curated dataset of legal and regulatory tasks explicitly structured around Bloom’s taxonomy. Mike is being fine-tuned on: • Interpreting GDPR, AI Act, and financial regulations • Applying legal tests (e.g., necessity, proportionality) • Providing structured and explainable legal reasoning Mike is designed to bridge the gap between legal domain expertise and generalist AI reasoning. Its rollout will begin in the IT and data privacy sectors, before expanding to finance, energy, and public law.

Want to test Mike or get early access to our compliance benchmarking suite? Reach out to our team or explore our Jessica platform to see how structured legal AI can elevate your compliance practice.