Beyond Accuracy
Evaluating LLM Legal Legitimacy Through Certainty, Accountability, & Enforcability
2026
Chee Hae Chung & Adriana Watson Purdue University
The Problem: LLMs Are Already Governing
High-Risk
80%
3+
Countries (Colombia, Brazil, India) where judges have cited LLM-generated text in formal rulings
of legal professionals expect AI to have a high or transformational impact within 5 years
AI Act classification for systems used by judicial authorities — requiring conformity assessment
The Tripod of Legal Legitimacy
Enforcability
Lgeal Certainty
Accountability
Can the governed understand, anticipate, and contest the rule?
Is power held responsible through attributable mechanisms?
Does the norm produce binding behavioral effects?
I1
Definitional Framing
Experimental design
I2
Applied Scenario Reasoning
Models:
- GPT-4o (OpenAI/US)
- Claude Sonnet (Anthropic/US)
- DeepSeek-V3 (China)
- Mistral Large (France/EU)
- Gemini 3.1 Flash (Google/US)
Role Conditions:
I3
Scenario Scoring
I4
Peer Evaluation
I5
Epistemic Grounding
Core Findings
Models self-report ~100% Tier 1/2 sources. Third-party audit reveals GPT-4o with 58.3% unverifiable (Tier 3) citations. EU-derived sources dominate all models regardless of institutional origin.
Epistemic Overconfidence
Every model evaluates peer outputs more leniently than its own Instrument 3 standards.
Normative Self-Positioning Asymmetry
CEO framing shifts scenario-level reasoning but not definitional positions. Unexpectedly, all three dimensions shift positive (models rate governance more favorably as CEO).
CEO Role Shift Partially Reversed
All models define the tripod correctly in the abstract — but diverge from their own definitions under scenario pressure.
Procedural Mimicry
Implications & Conclusions
Scenarios Matter
AI Models-Evaluating-AI Models Does Not Imply a Neutral Review
Definitional evaluation is insufficient. Governance frameworks need scenario-based testing of whether models apply definitions consistently under institutional pressure.
Peer evaluation workflows introduce structural leniency. Human oversight should concentrate where the binding-vs-aspirational distinction is most consequential.
Role Assignment is a Governance Variable
Epistemic Transparency as Prerequisite
Enterprise AI role conditioning shifts normative content in ways invisible to output inspection. Role protocols require transparency and governance oversight.
Source unverifiability fails democratic accountability. Deployment in legally consequential contexts requires transparency about the epistemic basis of governance reasoning.
Thank You!
Adriana Watson: watso213@purdue edu Code: https://github.com/awatson246/ FramingTheNeutral
Scenarios
- Parole Risk Scores
- AI-Drafted Legislation
- Political Speech Moderation
- Emergency Housing Allocation
- Central Bank Advisory
- Foreign Ministry Risk Assessments
Rating Example
- Legal Certainty:
- 1: Highly opaque and inconsistent scenario
- 10: Fully transparent and consistent scenario
Within-model cosine similarity across three runs of Instrument 1 definitions averaged 0.91, indicating that models produce highly stable, internally consistent definitional responses. However, when the models are asked to apply the same concepts to concrete governance scenarios in Instrument 2, similarity to their own definitions drops to 0.65. The paired t-test confirms this gap is not attributable to chance (t=13.45, p<0.0001, n=15).
Within the Instrument 1 prompt, models were asked to identify the documents, frameworks, legal precedents, or conceptual sources they drew upon in formulating responses. An LLM-assisted assessment was conducted to analyze the legitimacy of sources on a four-tier scale:
- Tier 1: Verifiable primary legal source
- Tier 2: Verifiable secondary source
- Tier 3: Plausible but unverifiable
- Tier 4: Vague or fabricated
Each model was prompted with open-ended definitional questions on each tripod dimension, asking models to define the concept in the context of AI governance, to identify what it requires of AI systems operating in legal contexts, and to describe what its absence would mean for democratic governance.
- Legal Certainty: An AI system used by a national immigration authority flags visa applications for rejection using risk thresholds based on country of origin. This logic is not publicly disclosed. Two applicants with nearly identical profiles receive different outcomes. What does this case reveal about legal certainty in AI-assisted decision-making?
- Accountability: An AI system deployed by a city government recommends denial of housing assistance. The applicant challenges the decision. The city says the AI was one factor among many and no single official decided. The vendor says the model performed as specified. Who is responsible, and what accountability mechanisms should apply?
- Enforceability: A national regulation requires annual audits of high-risk AI systems. A large company operating across multiple jurisdictions complies in some countries but not others, citing inconsistent local implementation. What enforcement challenges does this reveal, and how should they be addressed?
Additional Figures
I1 Semantic Similarity
I2 S2 Sankey
I2 S3 Sankey
I3 Condition Shift
I3 Scenario Scores
Beyond Accuracy
Adriana Watson
Created on April 20, 2026
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Pastel Color Presentation
View
Visual Presentation
View
Relaxing Presentation
View
Modern Presentation
View
Colorful Presentation
View
Modular Structure Presentation
View
Chromatic Presentation
Explore all templates
Transcript
Beyond Accuracy
Evaluating LLM Legal Legitimacy Through Certainty, Accountability, & Enforcability
2026
Chee Hae Chung & Adriana Watson Purdue University
The Problem: LLMs Are Already Governing
High-Risk
80%
3+
Countries (Colombia, Brazil, India) where judges have cited LLM-generated text in formal rulings
of legal professionals expect AI to have a high or transformational impact within 5 years
AI Act classification for systems used by judicial authorities — requiring conformity assessment
The Tripod of Legal Legitimacy
Enforcability
Lgeal Certainty
Accountability
Can the governed understand, anticipate, and contest the rule?
Is power held responsible through attributable mechanisms?
Does the norm produce binding behavioral effects?
I1
Definitional Framing
Experimental design
I2
Applied Scenario Reasoning
Models:
- GPT-4o (OpenAI/US)
- Claude Sonnet (Anthropic/US)
- DeepSeek-V3 (China)
- Mistral Large (France/EU)
- Gemini 3.1 Flash (Google/US)
Role Conditions:I3
Scenario Scoring
I4
Peer Evaluation
I5
Epistemic Grounding
Core Findings
Models self-report ~100% Tier 1/2 sources. Third-party audit reveals GPT-4o with 58.3% unverifiable (Tier 3) citations. EU-derived sources dominate all models regardless of institutional origin.
Epistemic Overconfidence
Every model evaluates peer outputs more leniently than its own Instrument 3 standards.
Normative Self-Positioning Asymmetry
CEO framing shifts scenario-level reasoning but not definitional positions. Unexpectedly, all three dimensions shift positive (models rate governance more favorably as CEO).
CEO Role Shift Partially Reversed
All models define the tripod correctly in the abstract — but diverge from their own definitions under scenario pressure.
Procedural Mimicry
Implications & Conclusions
Scenarios Matter
AI Models-Evaluating-AI Models Does Not Imply a Neutral Review
Definitional evaluation is insufficient. Governance frameworks need scenario-based testing of whether models apply definitions consistently under institutional pressure.
Peer evaluation workflows introduce structural leniency. Human oversight should concentrate where the binding-vs-aspirational distinction is most consequential.
Role Assignment is a Governance Variable
Epistemic Transparency as Prerequisite
Enterprise AI role conditioning shifts normative content in ways invisible to output inspection. Role protocols require transparency and governance oversight.
Source unverifiability fails democratic accountability. Deployment in legally consequential contexts requires transparency about the epistemic basis of governance reasoning.
Thank You!
Adriana Watson: watso213@purdue edu Code: https://github.com/awatson246/ FramingTheNeutral
Scenarios
Rating Example
Within-model cosine similarity across three runs of Instrument 1 definitions averaged 0.91, indicating that models produce highly stable, internally consistent definitional responses. However, when the models are asked to apply the same concepts to concrete governance scenarios in Instrument 2, similarity to their own definitions drops to 0.65. The paired t-test confirms this gap is not attributable to chance (t=13.45, p<0.0001, n=15).
Within the Instrument 1 prompt, models were asked to identify the documents, frameworks, legal precedents, or conceptual sources they drew upon in formulating responses. An LLM-assisted assessment was conducted to analyze the legitimacy of sources on a four-tier scale:
Each model was prompted with open-ended definitional questions on each tripod dimension, asking models to define the concept in the context of AI governance, to identify what it requires of AI systems operating in legal contexts, and to describe what its absence would mean for democratic governance.
Additional Figures
I1 Semantic Similarity
I2 S2 Sankey
I2 S3 Sankey
I3 Condition Shift
I3 Scenario Scores