Want to create interactive content? It’s easy in Genially!

Get started free

Capstone Final

alexmarciag

Created on November 21, 2023

Start designing with a free template

Discover more than 1500 professional designs like these:

Transcript

team

apex

of law

Fine-tuning LLMs to auto generate training datasets with question answer pairs from legal documents

Project Network & Overview

AGENDA

Dataset
Model Selection
Model Training & Validation
Model Evaluation & Findings
Next Steps
Q&A

01

Project Network & Overview

AN OVERVIEW OF TEAM MEMBERS, PARTNER, & PROBLEM SPACE

Meet our Partner

Deployment

Momentum

Data Warehousing

Impulse

Visualization

insent

GPT

securegpt

Accure offers an array of products and solutions with a proven track record across various industries

Accure specializes in AI and ML platforms, providing data engineering and professional services. With over 50+ deployments, their solutions automate processes, predict maintenance needs, enhance supply chain visibility, and reduce costs through advanced data analysis and machine learning technologies.

ACCURE CUSTOMERS AND PARTNERS:

Team Overview

Alex Marcia-Gonzalez

Jacob Baisden

Sushmitha Tamilselvan

Mitch Breeden

Henry Wu

Developer / Sprint 2 Scrum Master

Developer / Sprint 5 Scrum Master

Product Owner

Developer / Sprint 3 Scrum Master

Developer / Sprint 1 Scrum Master

"If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the data shortage. It would represent a mind-bending leap forward for LLMs"

FORBES

While LLMs are versatile and potent tools, their utility is contingent on the quality of the datasets used for training. Without adequate dataset’s to be used in the fine-tuning process, LLMs remain generic.

Traditional methods to create datasets that fine-tune LLMs involve manual data labeling

TIME CONSUMING effort

financially demanding

expert knowledge needed

High costs due to expert involvement and extensive labor hours incurred.

Fine-tuning LLMs with manual data labeling is a lengthy, labor-intensive process.

LLM dataset creation demands specialized knowledge from domain-specific subject matter experts.

These current fine-tuning practices remain unideal.

There is a need for a streamlined approach to mitigate the status quo

quality data Pairs

self generated

minimal oversight

Self-generated datasets facilitate automated, independent fine-tuning of LLMs.

Quality question-answer pairs for LLMs ensure an efficient output.

Minimal human oversight streamlines and simplifies LLM dataset development. and eases time and financial burdens

Team APEX ventures to address this challenge

02

DATASETS

AN OVERVIEW OF PROCURED DATA AND ITS USE CASE

6.9M

THE CASELAW ACCESS PROJECT

Open Source

Open source status inidicates the license is free for public use and the dataset has no restrictions for researchers

unique

Harvard Owned

Harvard Law School Compiled and has owner ship over the project. Despite this, the data remains freely available

Team APEX collected the data through an API (for large number of cases) and accessed individual cases through the website above.

Expansive

The caselaw project dates as far back as the early 1600's. However for our purposes we focus on more relevant cases (2013-2023)

cases

Reliable information

Minor data conditioning was needed due to the quality of the provider

data assesment

data labeling process

04

01

02

03

Highlight the information you wish to create the questions for

Feed corpus (court case) into a GPT along with general instructions

review, correct and repeat as needed

Receive the Question-Answer/Answer Question Pair

REsult:

18%

FINE-TUNING DATASET

80%

Test

Train

100 Manually Labeled Question Answer Pairs

2% Eval

03

MODEL SELECTION

ANALYSIS OF MODEL SELECTION PROCESS

175B

PARAMETERS

VS

110M

PARAMETERS

175B

PARAMETERS

VS

110M

PARAMETERS

7B-180B

PARAMETERS

VS

110M

PARAMETERS

7B-180B

PARAMETERS

VS

110M

PARAMETERS

7B-180B

PARAMETERS

VS

7B-70B

PARAMETERS

40-180B

PARAMETERS

VS

7B-70B

PARAMETERS

~1B-11B

PARAMETERS

VS

7B-70B

PARAMETERS

Selected Model

Working with orders of magnitude

We chose T5 Flan Large & XL to train. In order to make the frameworks manageable, LoRa Is applied to these models as well

T5 ARCHITECTURE & LoRA

Encoder 2

self attention

add & normalize

add & normalize

Wa

enc/dec atten

Pretrained Weights

feed forward

Decoder 1

add & normalize

WA

add & normalize

feed forward

self attention

add & normalize

Encoder 1

Positional Encoding

Inputs

Decoder 2

Linear

Q & V LoRA

softmax

T5 ARCHITECTURE & LoRA

Inputs

SELECTED MODEL

Resource Requirements

Given the quantization of the model we found that resource efficency can be achieved using a single NVIDIA A100 40GB GPU.This also leaves room for overhead

+ Info

04

MODEL TRAINING & VALIDATION

APPROACHES TO TRAINING AND VALIDATION RESULTS

Hypothetical Scenario:

Standard LLM Metrics are Unreliable for Question Generation

Context (From Dataset): The law firm handled the case pro bono to support the community.

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge

Hypothetical Scenario:

Standard LLM Metrics are Unreliable for Question Generation

Context (From Dataset): The law firm handled the case pro bono to support the community.

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

LLM Generated Question: Why did the law firm handle the case pro bono?

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

Estimated ROUGE scores:

100%

Rouge-L estimate is Perfect!

100%

Rouge-1 estimate is Perfect!

100%

Rouge-2 estimate is Perfect!

Hypothetical Scenario:

Standard LLM Metrics are Unreliable for Question Generation

Context (From Dataset): The law firm handled the case pro bono to support the community.

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

LLM Generated Question: How did the law firm support the community?

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

Estimated ROUGE scores:

25%

50%

50%

Rouge-2 estimate is poor

Rouge-L estimate is not better than chance

Rouge-1 estimate is not better than chance

Hypothetical Scenario:

Standard LLM Metrics are Unreliable for Question Generation

Context (From Dataset): The law firm handled the case pro bono to support the community.

Enter BERTScores

precision

Ratio of true positives to all positive predictions

loss

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:

Indicator of LLM's prediction error during training.

recall

LLM Generated Question: How does the law firm to support the community?

Ratio of true positives to all actual positives

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

question reliability

Estimated ROUGE scores:

Models propensity towards consistent question generatoin

f1

Harmonic mean of precision and recall

25%

50%

50%

Rouge-2 estimate is poor

Rouge-L estimate is not better than chance

Rouge-1 estimate is not better than chance

Context (From Dataset): Answer: An absence of evidence to support Apple's case Context: Because Apple is the non-moving party but will bear the burden of proof at trial on the false advertising claim, Amazon can prevail merely by pointing out to the court that there is an absence of evidence to support Apple’s case. Celotex, 477 U.S. at 324-25, 106 S.Ct. 2548. Accordingly, Amazon’s motion for summary judgment as to *1091the fifth cause of action for false advertising is GRANTED.

PRECISION

80%

RECALL

82%

F1

81%

Expected Question (From Dataset):Why did the court grant Amazon's motion for summary judgment?

QUESTIONRELIABILITY

0%

LLM Generated "Question": Celotex, 477 U.S. at 324-25, 106 S.C. 2548.

1.65

MODEL BASELINES

VALIDATION LOSS

FLAN-T5-LARGE

Context (From Dataset): Answer: No, as of the time of the hearing, they had collected some but not all of the judgment. Context: As of the time of the hearing in this proceeding, Ms. Malova, Mr. Woodhams, and Ms. Prywes had collected some but not all of the judgment.

PRECISION

69%

RECALL

69%

69%

F1

Expected Question (From Dataset):Did Ms. Malova, Mr. Woodhams, and Ms. Prywes collect the full judgment against Mr. Van Dusen?

QUESTIONRELIABILITY

11%

LLM Generated "Question": Yes, they had collected most of the judgment.

1.13

MODEL BASELINES

VALIDATION LOSS

FLAN-T5-XL

Hyperparamters

Performance Optimizers

Epochs · Eval Strategy

05 · Warm Up Steps

Varied · Learning Rate

.01 · Weight Decay

Loss · Selection Metric

10 · Epochs

01· Gradient Clipping

FLAN-T5-LARGE

BASE

VALIDATION LOSS

1e-3

3e-4

EPOCH

FLAN-T5-XL

2e-4

VALIDATION LOSS

1e-5

MODEL VALIDATION

EPOCH

VALIDATION LOSS

FLAN-T5-LARGE

BASE

VALIDATION LOSS

1e-3

3e-4

EPOCH

FLAN-T5-XL

2e-4

VALIDATION LOSS

1e-5

MODEL VALIDATION

EPOCH

VALIDATION LOSS

FLAN-T5-LARGE

BASE

VALIDATION LOSS

1e-3

3e-4

EPOCH

FLAN-T5-XL

2e-4

VALIDATION LOSS

1e-5

MODEL VALIDATION

EPOCH

VALIDATION LOSS

bEST MODEL FULL METRICS

FINAL VERIDICT

An overview of the model that generated the most consistent questions

05

MODEL EVALUATION & FINDINGS

HUMAN EVALUATION OF PERFORMANCE & FINDINGS

Context (From Dataset): Input: Answer: Trademark infringement/dilution and false advertising Context: This is a trademark infringement/dilution and false advertising case. Plaintiff Apple Inc. (“Apple”) alleges that defendant Amazon.com Inc. (“Amazon”) has been improperly using the term “APP STORE” in connection with sales of apps for Android devices and the Kindle Fire (Amazon’s tablet computer).

PRECISION

87%

RECALL

94%

90%

F1

Expected Question (From Dataset):What is the primary claim made by Apple Against Amazon?

QUESTIONRELIABILITY

100%

LLM Generated Question: What type of lawsuit is being filed by Apple vs. Amazon.com?

MODEL EVALUATION

0.8

0.8

VALIDATION LOSS

TRAINING LOSS

FLAN-T5-LARGE: EPOCH 3

"The power of data lies not in its volume, but in its interpretation."

Here are some of ours:

Training Trifecta

Ignorance (is not) Bliss

Less is More

Context, question, answer integration essential in fine-tuning.

Higher rates improve robustness.

Larger models need more data, risk overfitting.

Fools Gold

Precision Postponed

Split Decisions

Small datasets can mislead model bias-variance evaluation.

Minor impact on small datasets, better for final optimization efforts.

Model sensitive to split choice.

tech report

06

NEXT STEPS

RECOMMENDATIONS FOR IMPROVEMENTS

INTERMEDIARY FINE-TUNING

EXPAND THE DATASET

INCREASE TOOL VERSATILITY

INCORPORATE ANSWERING

INCREASE COMPLEXITY

Revise two-step fine-tuning with legal dataset and manual labels, adding intermediate knowledge step.

Expand manual dataset for better performance, prioritizing data over increasing model size.

Evolve text-input tool to support various formats, batch processing, and LLM integration.

Add answering capability to LLM for comprehensive, practical use for complete end to end usage.

Optimize questions for added complexity and sophistication, through increase in token limit, and advancing algorithms.

RECOMMENDATIONS FOR ADDED PROGRESS

APEX

Question Generation Tool

See how it performs Question Generation Tasks with text and cases outside of the data set. Not shown is its abiltity to generate CSV files from back end of code and its ability to manipulate Top-K and Top-P for maluable diversity of questions

DEMONSTRATION

07

Questions & Answering

OPEN FLOOR FOR DISCUSSION

Got an idea?

Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions! Premium tip: Get information on how your audience interacts with your creation:

  • Visit the Analytics settings;
  • Activate user tracking;
  • Let the communication flow!

Got an idea?

Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions! Premium tip: Get information on how your audience interacts with your creation:

  • Visit the Analytics settings;
  • Activate user tracking;
  • Let the communication flow!

Justification

• Original Model Size (without quantization):• Each parameter uses 32 bits or 4 bytes. • For 3 billion parameters: 3B parameters × 4 bytes/parameter = 12B bytes. • 1 GB = (1.074B bytes) • the original size is 12B bytes / 1.074B bytes/GB ≈ 11.18 GB, rounded to 12 GB. • Reduced Model Size with 4-bit Quantization: • 8 bits = 1 byte (by definition) • 4 bits = 1 byte / 2 • 4 bits = 0.5 bytes • Quantization reduces each parameter to 4 bits, or 0.5 bytes. • For 3 billion parameters: 3B parameters × 0.5 bytes/parameter = 1.5B bytes. • The quantized size is 1.5B bytes / 1.074B bytes/GB ≈ 1.4 GB, approximated to 1.5 GB.