Want to make creations as awesome as this one?

Transcript

of law

team

Fine-tuning LLMs to auto generate training datasets with question answer pairs from legal documents

apex

AGENDA

Next Steps
Model Training & Validation
Q&A
Model Evaluation & Findings
Model Selection
Dataset
Project Network & Overview

AN OVERVIEW OF TEAM MEMBERS, PARTNER, & PROBLEM SPACE

01

Project Network & Overview

Accure specializes in AI and ML platforms, providing data engineering and professional services. With over 50+ deployments, their solutions automate processes, predict maintenance needs, enhance supply chain visibility, and reduce costs through advanced data analysis and machine learning technologies.

ACCURE CUSTOMERS AND PARTNERS:
Meet our Partner

GPT

securegpt

Visualization

Data Warehousing

Deployment

insent

Impulse

Momentum

Accure offers an array of products and solutions with a proven track record across various industries

Developer / Sprint 5 Scrum Master

Developer / Sprint 1 Scrum Master

Developer / Sprint 2 Scrum Master

Alex Marcia-Gonzalez

Developer / Sprint 3 Scrum Master

Henry Wu

Mitch Breeden

Product Owner

Sushmitha Tamilselvan

Jacob Baisden

Team Overview

While LLMs are versatile and potent tools, their utility is contingent on the quality of the datasets used for training. Without adequate dataset’s to be used in the fine-tuning process, LLMs remain generic.

FORBES

"If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the data shortage. It would represent a mind-bending leap forward for LLMs"

LLM dataset creation demands specialized knowledge from domain-specific subject matter experts.

expert knowledge needed

High costs due to expert involvement and extensive labor hours incurred.

financially demanding

Fine-tuning LLMs with manual data labeling is a lengthy, labor-intensive process.

TIME CONSUMING effort

These current fine-tuning practices remain unideal.

Traditional methods to create datasets that fine-tune LLMs involve manual data labeling

Minimal human oversight streamlines and simplifies LLM dataset development. and eases time and financial burdens

minimal oversight

Self-generated datasets facilitate automated, independent fine-tuning of LLMs.

self generated

Quality question-answer pairs for LLMs ensure an efficient output.

quality data Pairs

Team APEX ventures to address this challenge

There is a need for a streamlined approach to mitigate the status quo

AN OVERVIEW OF PROCURED DATA AND ITS USE CASE

02

DATASETS

THE CASELAW ACCESS PROJECT

unique

cases

6.9M

The caselaw project dates as far back as the early 1600's. However for our purposes we focus on more relevant cases (2013-2023)

Expansive

Harvard Law School Compiled and has owner ship over the project. Despite this, the data remains freely available

Harvard Owned

Open source status inidicates the license is free for public use and the dataset has no restrictions for researchers

Open Source

Team APEX collected the data through an API (for large number of cases) and accessed individual cases through the website above.

data assesment

Reliable information

Minor data conditioning was needed due to the quality of the provider

01

review, correct and repeat as needed

04

Receive the Question-Answer/Answer Question Pair

Highlight the information you wish to create the questions for

data labeling process

03

02

Feed corpus (court case) into a GPT along with general instructions

2% Eval

18%

80%

100 Manually Labeled Question Answer Pairs

FINE-TUNING DATASET

REsult:

Test

Train

ANALYSIS OF MODEL SELECTION PROCESS

03

MODEL SELECTION

PARAMETERS

110M

PARAMETERS

175B

VS

PARAMETERS

110M

PARAMETERS

175B

VS

PARAMETERS

110M

PARAMETERS

7B-180B

VS

PARAMETERS

110M

PARAMETERS

7B-180B

VS

PARAMETERS

7B-70B

PARAMETERS

7B-180B

VS

PARAMETERS

7B-70B

PARAMETERS

40-180B

VS

PARAMETERS

7B-70B

PARAMETERS

~1B-11B

VS

We chose T5 Flan Large & XL to train. In order to make the frameworks manageable, LoRa Is applied to these models as well

Working with orders of magnitude

Selected Model

T5 ARCHITECTURE & LoRA

softmax

add & normalize

WA
Wa
PretrainedWeights
Inputs
Inputs
Q & V LoRA

Positional Encoding

Encoder 1

Encoder 2

Decoder 1

Decoder 2

Linear

feed forward

add & normalize

self attention

add & normalize

enc/dec atten

self attention

add & normalize

feed forward

add & normalize

T5 ARCHITECTURE & LoRA

SELECTED MODEL

+ Info

Resource Requirements

Given the quantization of the model we found that resource efficency can be achieved using a single NVIDIA A100 40GB GPU.This also leaves room for overhead

APPROACHES TO TRAINING AND VALIDATION RESULTS

04

MODEL TRAINING & VALIDATION

Hypothetical Scenario:

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Standard LLM Metrics are Unreliable for QuestionGeneration

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge

Hypothetical Scenario:

Estimated ROUGE scores:

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

LLM Generated Question:Why did the law firm handle the case pro bono?

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Rouge-2 estimate is Perfect!

100%

Rouge-1 estimate is Perfect!

100%

Rouge-L estimate is Perfect!

100%

Standard LLM Metrics are Unreliable for QuestionGeneration

Hypothetical Scenario:

Estimated ROUGE scores:

Rouge-L estimate is not better than chance

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

Rouge-2 estimate is poor

25%

Rouge-1 estimate is not better than chance

50%

50%

LLM Generated Question:How did the law firm support the community?

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Standard LLM Metrics are Unreliable for QuestionGeneration

Hypothetical Scenario:

loss

Indicator of LLM's prediction error during training.

question reliability

Models propensity towards consistent question generatoin

precision

Ratio of true positives to all positive predictions

recall

Ratio of true positives to all actual positives

f1

Harmonic mean of precision and recall

This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:

Enter BERTScores

Estimated ROUGE scores:

Rouge-L estimate is not better than chance

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

Rouge-2 estimate is poor

25%

Rouge-1 estimate is not better than chance

50%

50%

LLM Generated Question:How does the law firm to support the community?

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Standard LLM Metrics are Unreliable for QuestionGeneration

loss

Indicator of LLM's prediction error during training.

question reliability

Models propensity towards consistent question generatoin

precision

Ratio of true positives to all positive predictions

recall

Ratio of true positives to all actual positives

f1

Harmonic mean of precision and recall

This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:

Enter BERTScores

VALIDATION LOSS

1.65

LLM Generated "Question":Celotex, 477 U.S. at 324-25, 106 S.C. 2548.

Expected Question (From Dataset):Why did the court grant Amazon's motion for summary judgment?

Context (From Dataset):Answer: An absence of evidence to support Apple's case Context: Because Apple is the non-moving party but will bear the burden of proof at trial on the false advertising claim, Amazon can prevail merely by pointing out to the court that there is an absence of evidence to support Apple’s case. Celotex, 477 U.S. at 324-25, 106 S.Ct. 2548. Accordingly, Amazon’s motion for summary judgment as to *1091the fifth cause of action for false advertising is GRANTED.

0%

QUESTIONRELIABILITY

81%

F1

82%

RECALL

80%

PRECISION

MODEL BASELINES

FLAN-T5-LARGE

VALIDATION LOSS

1.13

LLM Generated "Question":Yes, they had collected most of the judgment.

Expected Question (From Dataset):Did Ms. Malova, Mr. Woodhams, and Ms. Prywes collect the full judgment against Mr. Van Dusen?

Context (From Dataset):Answer: No, as of the time of the hearing, they had collected some but not all of the judgment. Context: As of the time of the hearing in this proceeding, Ms. Malova, Mr. Woodhams, and Ms. Prywes had collected some but not all of the judgment.

11%

QUESTIONRELIABILITY

69%

F1

69%

RECALL

69%

PRECISION

MODEL BASELINES

FLAN-T5-XL

01· Gradient Clipping

Loss · Selection Metric

Epochs · Eval Strategy

.01 · Weight Decay

Varied · Learning Rate

05 · Warm Up Steps

10 · Epochs

Performance Optimizers

Hyperparamters

1e-5

2e-4

3e-4

1e-3

EPOCH

EPOCH

VALIDATION LOSS

VALIDATION LOSS

FLAN-T5-LARGE

FLAN-T5-XL

BASE

MODEL VALIDATION

VALIDATION LOSS

1e-5

2e-4

3e-4

1e-3

EPOCH

EPOCH

VALIDATION LOSS

VALIDATION LOSS

FLAN-T5-LARGE

FLAN-T5-XL

BASE

MODEL VALIDATION

VALIDATION LOSS

1e-5

2e-4

3e-4

1e-3

EPOCH

EPOCH

VALIDATION LOSS

VALIDATION LOSS

FLAN-T5-LARGE

FLAN-T5-XL

BASE

MODEL VALIDATION

VALIDATION LOSS

An overview of the model that generated the most consistent questions

FINAL VERIDICT

bEST MODEL FULL METRICS

HUMAN EVALUATION OF PERFORMANCE & FINDINGS

05

MODEL EVALUATION & FINDINGS

TRAINING LOSS

0.8

VALIDATION LOSS

0.8

LLM Generated Question:What type of lawsuit is being filed by Apple vs. Amazon.com?

Expected Question (From Dataset):What is the primary claim made by Apple Against Amazon?

Context (From Dataset):Input: Answer: Trademark infringement/dilution and false advertising Context: This is a trademark infringement/dilution and false advertising case. Plaintiff Apple Inc. (“Apple”) alleges that defendant Amazon.com Inc. (“Amazon”) has been improperly using the term “APP STORE” in connection with sales of apps for Android devices and the Kindle Fire (Amazon’s tablet computer).

100%

QUESTIONRELIABILITY

90%

F1

94%

RECALL

87%

PRECISION

MODEL EVALUATION

FLAN-T5-LARGE: EPOCH 3

Here are some of ours:

Fools Gold

Small datasets can mislead model bias-variance evaluation.

Split Decisions

Model sensitive to split choice.

Precision Postponed

Minor impact on small datasets, better for final optimization efforts.

Ignorance (is not) Bliss

Higher rates improve robustness.

Training Trifecta

Context, question, answer integration essential in fine-tuning.

Less is More

Larger models need more data, risk overfitting.

"The power of data lies not in its volume, but in its interpretation."

tech report

RECOMMENDATIONS FOR IMPROVEMENTS

06

NEXT STEPS

Optimize questions for added complexity and sophistication, through increase in token limit, and advancing algorithms.

INCREASE COMPLEXITY

Add answering capability to LLM for comprehensive, practical use for complete end to end usage.

INCORPORATE ANSWERING

Evolve text-input tool to support various formats, batch processing, and LLM integration.

INCREASE TOOL VERSATILITY

Expand manual dataset for better performance, prioritizing data over increasing model size.

EXPAND THE DATASET

Revise two-step fine-tuning with legal dataset and manual labels, adding intermediate knowledge step.

INTERMEDIARY FINE-TUNING

RECOMMENDATIONS FOR ADDED PROGRESS

Question Generation Tool

See how it performs Question Generation Tasks with text and cases outside of the data set. Not shown is its abiltity to generate CSV files from back end of code and its ability to manipulate Top-K and Top-P for maluable diversity of questions

APEX

DEMONSTRATION

OPEN FLOOR FOR DISCUSSION

07

Questions & Answering

Got an idea?

Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions!Premium tip: Get information on how your audience interacts with your creation:

  • Visit the Analytics settings;
  • Activate user tracking;
  • Let the communication flow!

Got an idea?

Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions!Premium tip: Get information on how your audience interacts with your creation:

  • Visit the Analytics settings;
  • Activate user tracking;
  • Let the communication flow!

Justification

• Original Model Size (without quantization):• Each parameter uses 32 bits or 4 bytes. • For 3 billion parameters: 3B parameters × 4 bytes/parameter = 12B bytes. • 1 GB = (1.074B bytes) • the original size is 12B bytes / 1.074B bytes/GB ≈ 11.18 GB, rounded to 12 GB. • Reduced Model Size with 4-bit Quantization: • 8 bits = 1 byte (by definition) • 4 bits = 1 byte / 2 • 4 bits = 0.5 bytes • Quantization reduces each parameter to 4 bits, or 0.5 bytes. • For 3 billion parameters: 3B parameters × 0.5 bytes/parameter = 1.5B bytes. • The quantized size is 1.5B bytes / 1.074B bytes/GB ≈ 1.4 GB, approximated to 1.5 GB.