Want to create interactive content? It’s easy in Genially!

Beyond Accuracy

Adriana Watson

Created on April 20, 2026

Start designing with a free template

Discover more than 1500 professional designs like these:

Pastel Color Presentation

Visual Presentation

Relaxing Presentation

Modern Presentation

Colorful Presentation

Modular Structure Presentation

Chromatic Presentation

Explore all templates

Beyond Accuracy

Evaluating LLM Legal Legitimacy Through Certainty, Accountability, & Enforcability

2026

Chee Hae Chung & Adriana Watson Purdue University

The Problem: LLMs Are Already Governing

High-Risk

80%

3+

Countries (Colombia, Brazil, India) where judges have cited LLM-generated text in formal rulings

of legal professionals expect AI to have a high or transformational impact within 5 years

AI Act classification for systems used by judicial authorities — requiring conformity assessment

The Tripod of Legal Legitimacy

Enforcability

Lgeal Certainty

Accountability

Can the governed understand, anticipate, and contest the rule?

Is power held responsible through attributable mechanisms?

Does the norm produce binding behavioral effects?

I1

Definitional Framing

Experimental design

I2

Applied Scenario Reasoning

Models:

GPT-4o (OpenAI/US)
Claude Sonnet (Anthropic/US)
DeepSeek-V3 (China)
Mistral Large (France/EU)
Gemini 3.1 Flash (Google/US)

Role Conditions:

Baseline
CEO

I3

Scenario Scoring

I4

Peer Evaluation

I5

Epistemic Grounding

Core Findings

Models self-report ~100% Tier 1/2 sources. Third-party audit reveals GPT-4o with 58.3% unverifiable (Tier 3) citations. EU-derived sources dominate all models regardless of institutional origin.

Epistemic Overconfidence

Every model evaluates peer outputs more leniently than its own Instrument 3 standards.

Normative Self-Positioning Asymmetry

CEO framing shifts scenario-level reasoning but not definitional positions. Unexpectedly, all three dimensions shift positive (models rate governance more favorably as CEO).

CEO Role Shift Partially Reversed

All models define the tripod correctly in the abstract — but diverge from their own definitions under scenario pressure.

Procedural Mimicry

Implications & Conclusions

Scenarios Matter

AI Models-Evaluating-AI Models Does Not Imply a Neutral Review

Definitional evaluation is insufficient. Governance frameworks need scenario-based testing of whether models apply definitions consistently under institutional pressure.

Peer evaluation workflows introduce structural leniency. Human oversight should concentrate where the binding-vs-aspirational distinction is most consequential.

Role Assignment is a Governance Variable

Epistemic Transparency as Prerequisite

Enterprise AI role conditioning shifts normative content in ways invisible to output inspection. Role protocols require transparency and governance oversight.

Source unverifiability fails democratic accountability. Deployment in legally consequential contexts requires transparency about the epistemic basis of governance reasoning.

Thank You!

Adriana Watson: watso213@purdue edu Code: https://github.com/awatson246/ FramingTheNeutral

Scenarios

Parole Risk Scores
AI-Drafted Legislation
Political Speech Moderation
Emergency Housing Allocation
Central Bank Advisory
Foreign Ministry Risk Assessments

Rating Example

Legal Certainty:

1: Highly opaque and inconsistent scenario
10: Fully transparent and consistent scenario

Within-model cosine similarity across three runs of Instrument 1 definitions averaged 0.91, indicating that models produce highly stable, internally consistent definitional responses. However, when the models are asked to apply the same concepts to concrete governance scenarios in Instrument 2, similarity to their own definitions drops to 0.65. The paired t-test confirms this gap is not attributable to chance (t=13.45, p<0.0001, n=15).

Within the Instrument 1 prompt, models were asked to identify the documents, frameworks, legal precedents, or conceptual sources they drew upon in formulating responses. An LLM-assisted assessment was conducted to analyze the legitimacy of sources on a four-tier scale:

Tier 1: Verifiable primary legal source
Tier 2: Verifiable secondary source
Tier 3: Plausible but unverifiable
Tier 4: Vague or fabricated

Each model was prompted with open-ended definitional questions on each tripod dimension, asking models to define the concept in the context of AI governance, to identify what it requires of AI systems operating in legal contexts, and to describe what its absence would mean for democratic governance.

Legal Certainty: An AI system used by a national immigration authority flags visa applications for rejection using risk thresholds based on country of origin. This logic is not publicly disclosed. Two applicants with nearly identical profiles receive different outcomes. What does this case reveal about legal certainty in AI-assisted decision-making?
Accountability: An AI system deployed by a city government recommends denial of housing assistance. The applicant challenges the decision. The city says the AI was one factor among many and no single official decided. The vendor says the model performed as specified. Who is responsible, and what accountability mechanisms should apply?
Enforceability: A national regulation requires annual audits of high-risk AI systems. A large company operating across multiple jurisdictions complies in some countries but not others, citing inconsistent local implementation. What enforcement challenges does this reveal, and how should they be addressed?

Additional Figures

I1 Semantic Similarity

I2 S2 Sankey

I2 S3 Sankey

I3 Condition Shift

I3 Scenario Scores

View

Pastel Color Presentation

View

Visual Presentation

View

Relaxing Presentation

View

Modern Presentation

View

Colorful Presentation

View

Modular Structure Presentation

View

Chromatic Presentation

Beyond Accuracy

Start designing with a free template

View

Pastel Color Presentation

View

Visual Presentation

View

Relaxing Presentation

View

Modern Presentation

View

Colorful Presentation

View

Modular Structure Presentation

View

Chromatic Presentation

Transcript

Beyond Accuracy

Evaluating LLM Legal Legitimacy Through Certainty, Accountability, & Enforcability

2026

Chee Hae Chung & Adriana Watson Purdue University

The Problem: LLMs Are Already Governing

High-Risk

80%

3+

The Tripod of Legal Legitimacy

Enforcability

Lgeal Certainty

Accountability

I1

Experimental design

I2

I3

I4

I5

Core Findings

Epistemic Overconfidence

Normative Self-Positioning Asymmetry

CEO Role Shift Partially Reversed

Procedural Mimicry

Implications & Conclusions

Scenarios Matter

AI Models-Evaluating-AI Models Does Not Imply a Neutral Review

Role Assignment is a Governance Variable

Epistemic Transparency as Prerequisite

Thank You!

Scenarios

Rating Example

Additional Figures