Capstone Final
Alex Marcia
Created on November 21, 2023
More creations to inspire you
2021 TRENDING COLORS
Presentation
HISTORY OF THE CIRCUS
Presentation
LETTERING PRESENTATION
Presentation
SPRING HAS SPRUNG!
Presentation
BIDEN’S CABINET
Presentation
VACCINES & IMMUNITY
Presentation
UNCOVERING REALITY
Presentation
Transcript
of law
team
Fine-tuning LLMs to auto generate training datasets with question answer pairs from legal documents
apex
AGENDA
Next Steps
Model Training & Validation
Q&A
Model Evaluation & Findings
Model Selection
Dataset
Project Network & Overview
AN OVERVIEW OF TEAM MEMBERS, PARTNER, & PROBLEM SPACE
01
Project Network & Overview
Accure specializes in AI and ML platforms, providing data engineering and professional services. With over 50+ deployments, their solutions automate processes, predict maintenance needs, enhance supply chain visibility, and reduce costs through advanced data analysis and machine learning technologies.
ACCURE CUSTOMERS AND PARTNERS:
Meet our Partner
GPT
securegpt
Visualization
Data Warehousing
Deployment
insent
Impulse
Momentum
Accure offers an array of products and solutions with a proven track record across various industries
Developer / Sprint 5 Scrum Master
Developer / Sprint 1 Scrum Master
Developer / Sprint 2 Scrum Master
Alex Marcia-Gonzalez
Developer / Sprint 3 Scrum Master
Henry Wu
Mitch Breeden
Product Owner
Sushmitha Tamilselvan
Jacob Baisden
Team Overview
While LLMs are versatile and potent tools, their utility is contingent on the quality of the datasets used for training. Without adequate dataset’s to be used in the fine-tuning process, LLMs remain generic.
FORBES
"If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the data shortage. It would represent a mind-bending leap forward for LLMs"
LLM dataset creation demands specialized knowledge from domain-specific subject matter experts.
expert knowledge needed
High costs due to expert involvement and extensive labor hours incurred.
financially demanding
Fine-tuning LLMs with manual data labeling is a lengthy, labor-intensive process.
TIME CONSUMING effort
These current fine-tuning practices remain unideal.
Traditional methods to create datasets that fine-tune LLMs involve manual data labeling
Minimal human oversight streamlines and simplifies LLM dataset development. and eases time and financial burdens
minimal oversight
Self-generated datasets facilitate automated, independent fine-tuning of LLMs.
self generated
Quality question-answer pairs for LLMs ensure an efficient output.
quality data Pairs
Team APEX ventures to address this challenge
There is a need for a streamlined approach to mitigate the status quo
AN OVERVIEW OF PROCURED DATA AND ITS USE CASE
02
DATASETS
THE CASELAW ACCESS PROJECT
unique
cases
6.9M
The caselaw project dates as far back as the early 1600's. However for our purposes we focus on more relevant cases (2013-2023)
Expansive
Harvard Law School Compiled and has owner ship over the project. Despite this, the data remains freely available
Harvard Owned
Open source status inidicates the license is free for public use and the dataset has no restrictions for researchers
Open Source
Team APEX collected the data through an API (for large number of cases) and accessed individual cases through the website above.
data assesment
Reliable information
Minor data conditioning was needed due to the quality of the provider
01
review, correct and repeat as needed
04
Receive the Question-Answer/Answer Question Pair
Highlight the information you wish to create the questions for
data labeling process
03
02
Feed corpus (court case) into a GPT along with general instructions
2% Eval
18%
80%
100 Manually Labeled Question Answer Pairs
FINE-TUNING DATASET
REsult:
Test
Train
ANALYSIS OF MODEL SELECTION PROCESS
03
MODEL SELECTION
PARAMETERS
110M
PARAMETERS
175B
VS
PARAMETERS
110M
PARAMETERS
175B
VS
PARAMETERS
110M
PARAMETERS
7B-180B
VS
PARAMETERS
110M
PARAMETERS
7B-180B
VS
PARAMETERS
7B-70B
PARAMETERS
7B-180B
VS
PARAMETERS
7B-70B
PARAMETERS
40-180B
VS
PARAMETERS
7B-70B
PARAMETERS
~1B-11B
VS
We chose T5 Flan Large & XL to train. In order to make the frameworks manageable, LoRa Is applied to these models as well
Working with orders of magnitude
Selected Model
T5 ARCHITECTURE & LoRA
softmax
add & normalize
WA
Wa
PretrainedWeights
Inputs
Inputs
Q & V LoRA
Positional Encoding
Encoder 1
Encoder 2
Decoder 1
Decoder 2
Linear
feed forward
add & normalize
self attention
add & normalize
enc/dec atten
self attention
add & normalize
feed forward
add & normalize
T5 ARCHITECTURE & LoRA
SELECTED MODEL
+ Info
Resource Requirements
Given the quantization of the model we found that resource efficency can be achieved using a single NVIDIA A100 40GB GPU.This also leaves room for overhead
APPROACHES TO TRAINING AND VALIDATION RESULTS
04
MODEL TRAINING & VALIDATION
Hypothetical Scenario:
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Standard LLM Metrics are Unreliable for QuestionGeneration
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge
Hypothetical Scenario:
Estimated ROUGE scores:
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
LLM Generated Question:Why did the law firm handle the case pro bono?
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Rouge-2 estimate is Perfect!
100%
Rouge-1 estimate is Perfect!
100%
Rouge-L estimate is Perfect!
100%
Standard LLM Metrics are Unreliable for QuestionGeneration
Hypothetical Scenario:
Estimated ROUGE scores:
Rouge-L estimate is not better than chance
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
Rouge-2 estimate is poor
25%
Rouge-1 estimate is not better than chance
50%
50%
LLM Generated Question:How did the law firm support the community?
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Standard LLM Metrics are Unreliable for QuestionGeneration
Hypothetical Scenario:
loss
Indicator of LLM's prediction error during training.
question reliability
Models propensity towards consistent question generatoin
precision
Ratio of true positives to all positive predictions
recall
Ratio of true positives to all actual positives
f1
Harmonic mean of precision and recall
This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:
Enter BERTScores
Estimated ROUGE scores:
Rouge-L estimate is not better than chance
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
Rouge-2 estimate is poor
25%
Rouge-1 estimate is not better than chance
50%
50%
LLM Generated Question:How does the law firm to support the community?
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Standard LLM Metrics are Unreliable for QuestionGeneration
loss
Indicator of LLM's prediction error during training.
question reliability
Models propensity towards consistent question generatoin
precision
Ratio of true positives to all positive predictions
recall
Ratio of true positives to all actual positives
f1
Harmonic mean of precision and recall
This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:
Enter BERTScores
VALIDATION LOSS
1.65
LLM Generated "Question":Celotex, 477 U.S. at 324-25, 106 S.C. 2548.
Expected Question (From Dataset):Why did the court grant Amazon's motion for summary judgment?
Context (From Dataset):Answer: An absence of evidence to support Apple's case Context: Because Apple is the non-moving party but will bear the burden of proof at trial on the false advertising claim, Amazon can prevail merely by pointing out to the court that there is an absence of evidence to support Apple’s case. Celotex, 477 U.S. at 324-25, 106 S.Ct. 2548. Accordingly, Amazon’s motion for summary judgment as to *1091the fifth cause of action for false advertising is GRANTED.
0%
QUESTIONRELIABILITY
81%
F1
82%
RECALL
80%
PRECISION
MODEL BASELINES
FLAN-T5-LARGE
VALIDATION LOSS
1.13
LLM Generated "Question":Yes, they had collected most of the judgment.
Expected Question (From Dataset):Did Ms. Malova, Mr. Woodhams, and Ms. Prywes collect the full judgment against Mr. Van Dusen?
Context (From Dataset):Answer: No, as of the time of the hearing, they had collected some but not all of the judgment. Context: As of the time of the hearing in this proceeding, Ms. Malova, Mr. Woodhams, and Ms. Prywes had collected some but not all of the judgment.
11%
QUESTIONRELIABILITY
69%
F1
69%
RECALL
69%
PRECISION
MODEL BASELINES
FLAN-T5-XL
01· Gradient Clipping
Loss · Selection Metric
Epochs · Eval Strategy
.01 · Weight Decay
Varied · Learning Rate
05 · Warm Up Steps
10 · Epochs
Performance Optimizers
Hyperparamters
1e-5
2e-4
3e-4
1e-3
EPOCH
EPOCH
VALIDATION LOSS
VALIDATION LOSS
FLAN-T5-LARGE
FLAN-T5-XL
BASE
MODEL VALIDATION
VALIDATION LOSS
1e-5
2e-4
3e-4
1e-3
EPOCH
EPOCH
VALIDATION LOSS
VALIDATION LOSS
FLAN-T5-LARGE
FLAN-T5-XL
BASE
MODEL VALIDATION
VALIDATION LOSS
1e-5
2e-4
3e-4
1e-3
EPOCH
EPOCH
VALIDATION LOSS
VALIDATION LOSS
FLAN-T5-LARGE
FLAN-T5-XL
BASE
MODEL VALIDATION
VALIDATION LOSS
An overview of the model that generated the most consistent questions
FINAL VERIDICT
bEST MODEL FULL METRICS
HUMAN EVALUATION OF PERFORMANCE & FINDINGS
05
MODEL EVALUATION & FINDINGS
TRAINING LOSS
0.8
VALIDATION LOSS
0.8
LLM Generated Question:What type of lawsuit is being filed by Apple vs. Amazon.com?
Expected Question (From Dataset):What is the primary claim made by Apple Against Amazon?
Context (From Dataset):Input: Answer: Trademark infringement/dilution and false advertising Context: This is a trademark infringement/dilution and false advertising case. Plaintiff Apple Inc. (“Apple”) alleges that defendant Amazon.com Inc. (“Amazon”) has been improperly using the term “APP STORE” in connection with sales of apps for Android devices and the Kindle Fire (Amazon’s tablet computer).
100%
QUESTIONRELIABILITY
90%
F1
94%
RECALL
87%
PRECISION
MODEL EVALUATION
FLAN-T5-LARGE: EPOCH 3
Here are some of ours:
Fools Gold
Small datasets can mislead model bias-variance evaluation.
Split Decisions
Model sensitive to split choice.
Precision Postponed
Minor impact on small datasets, better for final optimization efforts.
Ignorance (is not) Bliss
Higher rates improve robustness.
Training Trifecta
Context, question, answer integration essential in fine-tuning.
Less is More
Larger models need more data, risk overfitting.
"The power of data lies not in its volume, but in its interpretation."
tech report
RECOMMENDATIONS FOR IMPROVEMENTS
06
NEXT STEPS
Optimize questions for added complexity and sophistication, through increase in token limit, and advancing algorithms.
INCREASE COMPLEXITY
Add answering capability to LLM for comprehensive, practical use for complete end to end usage.
INCORPORATE ANSWERING
Evolve text-input tool to support various formats, batch processing, and LLM integration.
INCREASE TOOL VERSATILITY
Expand manual dataset for better performance, prioritizing data over increasing model size.
EXPAND THE DATASET
Revise two-step fine-tuning with legal dataset and manual labels, adding intermediate knowledge step.
INTERMEDIARY FINE-TUNING
RECOMMENDATIONS FOR ADDED PROGRESS
Question Generation Tool
See how it performs Question Generation Tasks with text and cases outside of the data set. Not shown is its abiltity to generate CSV files from back end of code and its ability to manipulate Top-K and Top-P for maluable diversity of questions
APEX
DEMONSTRATION
OPEN FLOOR FOR DISCUSSION
07
Questions & Answering
Got an idea?
Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions!Premium tip: Get information on how your audience interacts with your creation:
- Visit the Analytics settings;
- Activate user tracking;
- Let the communication flow!
Got an idea?
Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions!Premium tip: Get information on how your audience interacts with your creation:
- Visit the Analytics settings;
- Activate user tracking;
- Let the communication flow!