Want to create interactive content? It’s easy in Genially!
Capstone Final
alexmarciag
Created on November 21, 2023
Start designing with a free template
Discover more than 1500 professional designs like these:
Transcript
team
apex
of law
Fine-tuning LLMs to auto generate training datasets with question answer pairs from legal documents
Project Network & Overview
AGENDA
Dataset
Model Selection
Model Training & Validation
Model Evaluation & Findings
Next Steps
Q&A
01
Project Network & Overview
AN OVERVIEW OF TEAM MEMBERS, PARTNER, & PROBLEM SPACE
Meet our Partner
Deployment
Momentum
Data Warehousing
Impulse
Visualization
insent
GPT
securegpt
Accure offers an array of products and solutions with a proven track record across various industries
Accure specializes in AI and ML platforms, providing data engineering and professional services. With over 50+ deployments, their solutions automate processes, predict maintenance needs, enhance supply chain visibility, and reduce costs through advanced data analysis and machine learning technologies.
ACCURE CUSTOMERS AND PARTNERS:
Team Overview
Alex Marcia-Gonzalez
Jacob Baisden
Sushmitha Tamilselvan
Mitch Breeden
Henry Wu
Developer / Sprint 2 Scrum Master
Developer / Sprint 5 Scrum Master
Product Owner
Developer / Sprint 3 Scrum Master
Developer / Sprint 1 Scrum Master
"If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the data shortage. It would represent a mind-bending leap forward for LLMs"
FORBES
While LLMs are versatile and potent tools, their utility is contingent on the quality of the datasets used for training. Without adequate dataset’s to be used in the fine-tuning process, LLMs remain generic.
Traditional methods to create datasets that fine-tune LLMs involve manual data labeling
TIME CONSUMING effort
financially demanding
expert knowledge needed
High costs due to expert involvement and extensive labor hours incurred.
Fine-tuning LLMs with manual data labeling is a lengthy, labor-intensive process.
LLM dataset creation demands specialized knowledge from domain-specific subject matter experts.
These current fine-tuning practices remain unideal.
There is a need for a streamlined approach to mitigate the status quo
quality data Pairs
self generated
minimal oversight
Self-generated datasets facilitate automated, independent fine-tuning of LLMs.
Quality question-answer pairs for LLMs ensure an efficient output.
Minimal human oversight streamlines and simplifies LLM dataset development. and eases time and financial burdens
Team APEX ventures to address this challenge
02
DATASETS
AN OVERVIEW OF PROCURED DATA AND ITS USE CASE
6.9M
THE CASELAW ACCESS PROJECT
Open Source
Open source status inidicates the license is free for public use and the dataset has no restrictions for researchers
unique
Harvard Owned
Harvard Law School Compiled and has owner ship over the project. Despite this, the data remains freely available
Team APEX collected the data through an API (for large number of cases) and accessed individual cases through the website above.
Expansive
The caselaw project dates as far back as the early 1600's. However for our purposes we focus on more relevant cases (2013-2023)
cases
Reliable information
Minor data conditioning was needed due to the quality of the provider
data assesment
data labeling process
04
01
02
03
Highlight the information you wish to create the questions for
Feed corpus (court case) into a GPT along with general instructions
review, correct and repeat as needed
Receive the Question-Answer/Answer Question Pair
REsult:
18%
FINE-TUNING DATASET
80%
Test
Train
100 Manually Labeled Question Answer Pairs
2% Eval
03
MODEL SELECTION
ANALYSIS OF MODEL SELECTION PROCESS
175B
PARAMETERS
VS
110M
PARAMETERS
175B
PARAMETERS
VS
110M
PARAMETERS
7B-180B
PARAMETERS
VS
110M
PARAMETERS
7B-180B
PARAMETERS
VS
110M
PARAMETERS
7B-180B
PARAMETERS
VS
7B-70B
PARAMETERS
40-180B
PARAMETERS
VS
7B-70B
PARAMETERS
~1B-11B
PARAMETERS
VS
7B-70B
PARAMETERS
Selected Model
Working with orders of magnitude
We chose T5 Flan Large & XL to train. In order to make the frameworks manageable, LoRa Is applied to these models as well
T5 ARCHITECTURE & LoRA
Encoder 2
self attention
add & normalize
add & normalize
Wa
enc/dec atten
Pretrained Weights
feed forward
Decoder 1
add & normalize
WA
add & normalize
feed forward
self attention
add & normalize
Encoder 1
Positional Encoding
Inputs
Decoder 2
Linear
Q & V LoRA
softmax
T5 ARCHITECTURE & LoRA
Inputs
SELECTED MODEL
Resource Requirements
Given the quantization of the model we found that resource efficency can be achieved using a single NVIDIA A100 40GB GPU.This also leaves room for overhead
+ Info
04
MODEL TRAINING & VALIDATION
APPROACHES TO TRAINING AND VALIDATION RESULTS
Hypothetical Scenario:
Standard LLM Metrics are Unreliable for Question Generation
Context (From Dataset): The law firm handled the case pro bono to support the community.
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge
Hypothetical Scenario:
Standard LLM Metrics are Unreliable for Question Generation
Context (From Dataset): The law firm handled the case pro bono to support the community.
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
LLM Generated Question: Why did the law firm handle the case pro bono?
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
Estimated ROUGE scores:
100%
Rouge-L estimate is Perfect!
100%
Rouge-1 estimate is Perfect!
100%
Rouge-2 estimate is Perfect!
Hypothetical Scenario:
Standard LLM Metrics are Unreliable for Question Generation
Context (From Dataset): The law firm handled the case pro bono to support the community.
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
LLM Generated Question: How did the law firm support the community?
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
Estimated ROUGE scores:
25%
50%
50%
Rouge-2 estimate is poor
Rouge-L estimate is not better than chance
Rouge-1 estimate is not better than chance
Hypothetical Scenario:
Standard LLM Metrics are Unreliable for Question Generation
Context (From Dataset): The law firm handled the case pro bono to support the community.
Enter BERTScores
precision
Ratio of true positives to all positive predictions
loss
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:
Indicator of LLM's prediction error during training.
recall
LLM Generated Question: How does the law firm to support the community?
Ratio of true positives to all actual positives
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
question reliability
Estimated ROUGE scores:
Models propensity towards consistent question generatoin
f1
Harmonic mean of precision and recall
25%
50%
50%
Rouge-2 estimate is poor
Rouge-L estimate is not better than chance
Rouge-1 estimate is not better than chance
Context (From Dataset): Answer: An absence of evidence to support Apple's case Context: Because Apple is the non-moving party but will bear the burden of proof at trial on the false advertising claim, Amazon can prevail merely by pointing out to the court that there is an absence of evidence to support Apple’s case. Celotex, 477 U.S. at 324-25, 106 S.Ct. 2548. Accordingly, Amazon’s motion for summary judgment as to *1091the fifth cause of action for false advertising is GRANTED.
PRECISION
80%
RECALL
82%
F1
81%
Expected Question (From Dataset):Why did the court grant Amazon's motion for summary judgment?
QUESTIONRELIABILITY
0%
LLM Generated "Question": Celotex, 477 U.S. at 324-25, 106 S.C. 2548.
1.65
MODEL BASELINES
VALIDATION LOSS
FLAN-T5-LARGE
Context (From Dataset): Answer: No, as of the time of the hearing, they had collected some but not all of the judgment. Context: As of the time of the hearing in this proceeding, Ms. Malova, Mr. Woodhams, and Ms. Prywes had collected some but not all of the judgment.
PRECISION
69%
RECALL
69%
69%
F1
Expected Question (From Dataset):Did Ms. Malova, Mr. Woodhams, and Ms. Prywes collect the full judgment against Mr. Van Dusen?
QUESTIONRELIABILITY
11%
LLM Generated "Question": Yes, they had collected most of the judgment.
1.13
MODEL BASELINES
VALIDATION LOSS
FLAN-T5-XL
Hyperparamters
Performance Optimizers
Epochs · Eval Strategy
05 · Warm Up Steps
Varied · Learning Rate
.01 · Weight Decay
Loss · Selection Metric
10 · Epochs
01· Gradient Clipping
FLAN-T5-LARGE
BASE
VALIDATION LOSS
1e-3
3e-4
EPOCH
FLAN-T5-XL
2e-4
VALIDATION LOSS
1e-5
MODEL VALIDATION
EPOCH
VALIDATION LOSS
FLAN-T5-LARGE
BASE
VALIDATION LOSS
1e-3
3e-4
EPOCH
FLAN-T5-XL
2e-4
VALIDATION LOSS
1e-5
MODEL VALIDATION
EPOCH
VALIDATION LOSS
FLAN-T5-LARGE
BASE
VALIDATION LOSS
1e-3
3e-4
EPOCH
FLAN-T5-XL
2e-4
VALIDATION LOSS
1e-5
MODEL VALIDATION
EPOCH
VALIDATION LOSS
bEST MODEL FULL METRICS
FINAL VERIDICT
An overview of the model that generated the most consistent questions
05
MODEL EVALUATION & FINDINGS
HUMAN EVALUATION OF PERFORMANCE & FINDINGS
Context (From Dataset): Input: Answer: Trademark infringement/dilution and false advertising Context: This is a trademark infringement/dilution and false advertising case. Plaintiff Apple Inc. (“Apple”) alleges that defendant Amazon.com Inc. (“Amazon”) has been improperly using the term “APP STORE” in connection with sales of apps for Android devices and the Kindle Fire (Amazon’s tablet computer).
PRECISION
87%
RECALL
94%
90%
F1
Expected Question (From Dataset):What is the primary claim made by Apple Against Amazon?
QUESTIONRELIABILITY
100%
LLM Generated Question: What type of lawsuit is being filed by Apple vs. Amazon.com?
MODEL EVALUATION
0.8
0.8
VALIDATION LOSS
TRAINING LOSS
FLAN-T5-LARGE: EPOCH 3
"The power of data lies not in its volume, but in its interpretation."
Here are some of ours:
Training Trifecta
Ignorance (is not) Bliss
Less is More
Context, question, answer integration essential in fine-tuning.
Higher rates improve robustness.
Larger models need more data, risk overfitting.
Fools Gold
Precision Postponed
Split Decisions
Small datasets can mislead model bias-variance evaluation.
Minor impact on small datasets, better for final optimization efforts.
Model sensitive to split choice.
tech report
06
NEXT STEPS
RECOMMENDATIONS FOR IMPROVEMENTS
INTERMEDIARY FINE-TUNING
EXPAND THE DATASET
INCREASE TOOL VERSATILITY
INCORPORATE ANSWERING
INCREASE COMPLEXITY
Revise two-step fine-tuning with legal dataset and manual labels, adding intermediate knowledge step.
Expand manual dataset for better performance, prioritizing data over increasing model size.
Evolve text-input tool to support various formats, batch processing, and LLM integration.
Add answering capability to LLM for comprehensive, practical use for complete end to end usage.
Optimize questions for added complexity and sophistication, through increase in token limit, and advancing algorithms.
RECOMMENDATIONS FOR ADDED PROGRESS
APEX
Question Generation Tool
See how it performs Question Generation Tasks with text and cases outside of the data set. Not shown is its abiltity to generate CSV files from back end of code and its ability to manipulate Top-K and Top-P for maluable diversity of questions
DEMONSTRATION
07
Questions & Answering
OPEN FLOOR FOR DISCUSSION
Got an idea?
Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions! Premium tip: Get information on how your audience interacts with your creation:
- Visit the Analytics settings;
- Activate user tracking;
- Let the communication flow!
Got an idea?
Use this space to add awesome interactivity. Include text, images, videos, tables, PDFs... even interactive questions! Premium tip: Get information on how your audience interacts with your creation:
- Visit the Analytics settings;
- Activate user tracking;
- Let the communication flow!