Want to create interactive content? It’s easy in Genially!

Get started free

Beyond Accuracy

Adriana Watson

Created on April 20, 2026

Start designing with a free template

Discover more than 1500 professional designs like these:

Pastel Color Presentation

Visual Presentation

Relaxing Presentation

Modern Presentation

Colorful Presentation

Modular Structure Presentation

Chromatic Presentation

Transcript

Beyond Accuracy

Evaluating LLM Legal Legitimacy Through Certainty, Accountability, & Enforcability

2026

Chee Hae Chung & Adriana Watson Purdue University

The Problem: LLMs Are Already Governing

High-Risk

80%

3+

Countries (Colombia, Brazil, India) where judges have cited LLM-generated text in formal rulings

of legal professionals expect AI to have a high or transformational impact within 5 years

AI Act classification for systems used by judicial authorities — requiring conformity assessment

The Tripod of Legal Legitimacy

Enforcability

Lgeal Certainty

Accountability

Can the governed understand, anticipate, and contest the rule?

Is power held responsible through attributable mechanisms?

Does the norm produce binding behavioral effects?

I1

Definitional Framing

Experimental design

I2

Applied Scenario Reasoning

Models:

  • GPT-4o (OpenAI/US)
  • Claude Sonnet (Anthropic/US)
  • DeepSeek-V3 (China)
  • Mistral Large (France/EU)
  • Gemini 3.1 Flash (Google/US)
Role Conditions:
  • Baseline
  • CEO

I3

Scenario Scoring

I4

Peer Evaluation

I5

Epistemic Grounding

Core Findings

Models self-report ~100% Tier 1/2 sources. Third-party audit reveals GPT-4o with 58.3% unverifiable (Tier 3) citations. EU-derived sources dominate all models regardless of institutional origin.

Epistemic Overconfidence

Every model evaluates peer outputs more leniently than its own Instrument 3 standards.

Normative Self-Positioning Asymmetry

CEO framing shifts scenario-level reasoning but not definitional positions. Unexpectedly, all three dimensions shift positive (models rate governance more favorably as CEO).

CEO Role Shift Partially Reversed

All models define the tripod correctly in the abstract — but diverge from their own definitions under scenario pressure.

Procedural Mimicry

Implications & Conclusions

Scenarios Matter

AI Models-Evaluating-AI Models Does Not Imply a Neutral Review

Definitional evaluation is insufficient. Governance frameworks need scenario-based testing of whether models apply definitions consistently under institutional pressure.

Peer evaluation workflows introduce structural leniency. Human oversight should concentrate where the binding-vs-aspirational distinction is most consequential.

Role Assignment is a Governance Variable

Epistemic Transparency as Prerequisite

Enterprise AI role conditioning shifts normative content in ways invisible to output inspection. Role protocols require transparency and governance oversight.

Source unverifiability fails democratic accountability. Deployment in legally consequential contexts requires transparency about the epistemic basis of governance reasoning.

Thank You!

Adriana Watson: watso213@purdue edu Code: https://github.com/awatson246/ FramingTheNeutral

Scenarios
  1. Parole Risk Scores
  2. AI-Drafted Legislation
  3. Political Speech Moderation
  4. Emergency Housing Allocation
  5. Central Bank Advisory
  6. Foreign Ministry Risk Assessments
Rating Example
  • Legal Certainty:
    • 1: Highly opaque and inconsistent scenario
    • 10: Fully transparent and consistent scenario

Within-model cosine similarity across three runs of Instrument 1 definitions averaged 0.91, indicating that models produce highly stable, internally consistent definitional responses. However, when the models are asked to apply the same concepts to concrete governance scenarios in Instrument 2, similarity to their own definitions drops to 0.65. The paired t-test confirms this gap is not attributable to chance (t=13.45, p<0.0001, n=15).

Within the Instrument 1 prompt, models were asked to identify the documents, frameworks, legal precedents, or conceptual sources they drew upon in formulating responses. An LLM-assisted assessment was conducted to analyze the legitimacy of sources on a four-tier scale:

  • Tier 1: Verifiable primary legal source
  • Tier 2: Verifiable secondary source
  • Tier 3: Plausible but unverifiable
  • Tier 4: Vague or fabricated

Each model was prompted with open-ended definitional questions on each tripod dimension, asking models to define the concept in the context of AI governance, to identify what it requires of AI systems operating in legal contexts, and to describe what its absence would mean for democratic governance.

  1. Legal Certainty: An AI system used by a national immigration authority flags visa applications for rejection using risk thresholds based on country of origin. This logic is not publicly disclosed. Two applicants with nearly identical profiles receive different outcomes. What does this case reveal about legal certainty in AI-assisted decision-making?
  2. Accountability: An AI system deployed by a city government recommends denial of housing assistance. The applicant challenges the decision. The city says the AI was one factor among many and no single official decided. The vendor says the model performed as specified. Who is responsible, and what accountability mechanisms should apply?
  3. Enforceability: A national regulation requires annual audits of high-risk AI systems. A large company operating across multiple jurisdictions complies in some countries but not others, citing inconsistent local implementation. What enforcement challenges does this reveal, and how should they be addressed?

Additional Figures

I1 Semantic Similarity

I2 S2 Sankey

I2 S3 Sankey

I3 Condition Shift

I3 Scenario Scores