CAPSTONE
Alex Marcia
Created on November 18, 2023
Over 30 million people build interactive content in Genially.
Check out what others have designed:
PRIVATE TOUR IN SÃO PAULO
Presentation
FACTS IN THE TIME OF COVID-19
Presentation
AUSSTELLUNG STORYTELLING
Presentation
WOLF ACADEMY
Presentation
STAGE2- LEVEL1-MISSION 2: ANIMATION
Presentation
TANGRAM PRESENTATION
Presentation
VALENTINE'S DAY PRESENTATION
Presentation
Transcript
of law
team
Fine-tuning LLMs to auto generate training datasets with question answer pairs from legal documents
apex
AGENDA
Next Steps
Model Training & Validation
Q&A
Model Evaluation & Findings
Model Selection
Dataset
Project Network & Overview
AN OVER VIEW OF TEAM MEMBERS, PARTNER, & PROBLEM SPACE
01
Project Network & Overview
Accure specializes in AI and ML platforms, providing data engineering and professional services. With over 50+ deployments, their solutions automate processes, predict maintenance needs, enhance supply chain visibility, and reduce costs through advanced data analysis and machine learning technologies.
ACCURE CUSTOMERS AND PARTNERS:
Meet our Partner
GPT
securegpt
Visualization
Data Warehousing
Deployment
insent
Impulse
Momentum
Accure offers an array of products and solutions with a proven track record across various industries
Developer
Alex Marcia-Gonzalez
Developer
Henry Wu
Developer
Mitch Breeden
Product Owner
Sushmitha Tamilselvan
Scrum Master
Jacob Baisden
Team Overview
While LLMs are versatile and potent tools, their utility is contingent on the quality of the datasets used for training. Without adequate dataset’s to be used in the fine-tuning process, LLMs remain generic.
FORBES
"If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the data shortage. It would represent a mind-bending leap forward for LLMs"
LLM dataset creation demands specialized knowledge from domain-specific subject matter experts.
expert knowledge needed
High costs due to expert involvement and extensive labor hours incurred.
financially demanding
Fine-tuning LLMs with manual data labeling is a lengthy, labor-intensive process.
TIME CONSUMING effort
These current fine-tuning practices remain unideal.
Traditional methods to create datasets that fine-tune LLMs involve manual data labeling
Minimal human oversight streamlines and simplifies LLM dataset development. and eases time and finacial burdens
minimal oversight
Self-generated datasets facilitate automated, independent fine-tuning of LLMs.
self generated
Quality question-answer pairs for LLMs ensure an efficeint output.
quality data Pairs
Team APEX ventures to address this challange
There is a need for a streamlined approach to mitigate the status quo
AN OVER VIEW OF PROCURED DATA AND ITS USE CASE
02
DATASETS
THE CASELAW ACCESS PROJECT
unique
cases
6.9M
The caselaw project dates as far back as the early 1800's. However for our purposes we focus on more relevant cases (2013-2023)
Expansive
Harvard Law School Compiled and has owner ship over the project. Despite this, the data remains freely available
Harvard Owned
Open source status inidicates the lciense is free for public use and the dataset has no restrictions for researchers
Open Source
Team APEX collected the data through an API (for large number of cases) and accessed individual cases through the website above.
data assesment
Reliable information
Minor data conditioning was needed due to the quality of the provider
01
review, correct and rpeat as needed
04
Receive the Question-Answer/Answer Question Pair
Highlight the information you wish to create the questions for
data labeling process
03
02
Feed corpus (court case) into a GPT along with general instructions
100 Question Answer Pairs
2% Eval
18% used for testing
80% used for training
description
type
field
item #
resulting dataset
ANALYSIS OF MODEL SELECTION PROCESS
03
MODEL SELECTION
PARAMETERS
110M
PARAMETERS
175B
VS
PARAMETERS
110M
PARAMETERS
40-180B
VS
PARAMETERS
7B-70B
PARAMETERS
40-180B
VS
PARAMETERS
7B-70B
PARAMETERS
~1B-11B
VS
We chose T5 Flan Large & XL to train. In order to make the frameworks manageable, LoRa Is applied to this models as well
Working with orders of magnitude
Selected Model
t5 ARCHITECTURE & LoRA
WA
Wa
PretrainedWeights
Inputs
Inputs
Q & V LoRA
Positional Encoding
Encoder 1
Encoder 2
Decoder 1
soft max
Linear
Decoder 2
feed forward
add & normalize
self attention
add & normalize
enc/dec atn
add & normalize
self attention
add & normalize
feed forward
add & normalize
t5 ARCHITECTURE & LoRA
SELECTED MODEL
+ Info
Resource Requirements
Given the quantization of the model we found that resource efficency can be achieved using a single NVIDIA A100 40GB GPU.This also leaves room for overhead
APPROACHES TO TRAINING AND VALIDATION RESULTS
04
MODEL TRAINING & VALIDATION
Hypothetical Scenario:
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Standard LLM Metrics are Unreliable for QuestionGeneration
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge
Hypothetical Scenario:
Estimated ROUGE scores:
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
LLM Generated Question:Why did the law firm handle the case pro bono?
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Rouge-2 estimate is Perfect!
100%
Rouge-1 estimate is Perfect!
100%
Rouge-L estimate is Perfect!
100%
Standard LLM Metrics are Unreliable for QuestionGeneration
Hypothetical Scenario:
Estimated ROUGE scores:
Rouge-L estimate is not better than chance
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
Rouge-2 estimate is poor
25%
Rouge-1 estimate is not better than chance
50%
50%
LLM Generated Question:How did the law firm support the community?
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Standard LLM Metrics are Unreliable for QuestionGeneration
Hypothetical Scenario:
loss
Indicator of LLM's prediction error during training.
question reliability
Models propensity towards consistent question generatoin
precision
Ratio of true positives to all positive predictions
recall
Ratio of true positives to all actual positives
f1
Harmonic mean of precision and recall
This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:
Enter BERTScores
Estimated ROUGE scores:
Rouge-L estimate is not better than chance
pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:
Rouge-2 estimate is poor
25%
Rouge-1 estimate is not better than chance
50%
50%
LLM Generated Question:How does the law firm to support the community?
Expected Question (From Dataset):Why did the law firm handle the case pro bono?
Context (From Dataset):The law firm handled the case pro bono to support the community.
Standard LLM Metrics are Unreliable for QuestionGeneration
loss
Indicator of LLM's prediction error during training.
question reliability
Models propensity towards consistent question generatoin
precision
Ratio of true positives to all positive predictions
recall
Ratio of true positives to all actual positives
f1
Harmonic mean of precision and recall
This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:
Enter BERTScores
VALIDATION LOSS
1.65
LLM Generated Question:Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild
Expected Question (From Dataset):What were the psychiatric diagnoses of Mr. Van Dusen during the surreptitious videotaping period?
Context (From Dataset):Input: answer: Mr. Van Dusen was diagnosed with Voyeuristic Disorder and Major Depressive Disorder — Mild by Dr. Jeffrey S. Janofsky. context: During the hearing in this matter, the Commission offered the testimony of psychiatrist Jeffrey S. Janofsky, MD, who was accepted by the hearing judge as an expert. According to Dr. Janofsky, at the time of the surreptitious videotaping, Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild.
0%
QUESTIONRELIABILITY
81%
F1
82%
RECALL
80%
PRECISION
MODEL BASELINES
FLAN-T5-LARGE
VALIDATION LOSS
1.13
LLM Generated Question:Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild
Expected Question (From Dataset):What were the psychiatric diagnoses of Mr. Van Dusen during the surreptitious videotaping period?
Context (From Dataset):Input: answer: Mr. Van Dusen was diagnosed with Voyeuristic Disorder and Major Depressive Disorder — Mild by Dr. Jeffrey S. Janofsky. context: During the hearing in this matter, the Commission offered the testimony of psychiatrist Jeffrey S. Janofsky, MD, who was accepted by the hearing judge as an expert. According to Dr. Janofsky, at the time of the surreptitious videotaping, Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild.
11%
QUESTIONRELIABILITY
69%
F1
69%
RECALL
69%
PRECISION
MODEL BASELINES
FLAN-T5-XL
01· Gradient Clipping
Loss · Selection Metric
Epochs · Eval Strategy
.01 · Weight Decay
Varied · Learning Rate
05 · Warm Up Steps
10 · Epochs
Performance Optimizers
Hyperparamters
1e-5
2e-4
3e-4
1e-3
EPOCH
EPOCH
VALIDATION LOSS
VALIDATION LOSS
FLAN-T5-LARGE
FLAN-T5-XL
BASE
MODEL VALIDATION
VALIDATION LOSS
1e-5
2e-4
3e-4
1e-3
EPOCH
EPOCH
VALIDATION LOSS
VALIDATION LOSS
FLAN-T5-LARGE
FLAN-T5-XL
BASE
MODEL VALIDATION
VALIDATION LOSS
1e-5
2e-4
3e-4
1e-3
EPOCH
EPOCH
VALIDATION LOSS
VALIDATION LOSS
FLAN-T5-LARGE
FLAN-T5-XL
BASE
MODEL VALIDATION
VALIDATION LOSS
An overview of the model that generated the most consistent questions
FINAL VERIDICT
bEST MODEL FULL METRICS
HUMAN EVALUATION OF PERFORMANCE & FINDINGS
05
MODEL EVALUATION & FINDINGS
TRAINING LOSS
0.8
VALIDATION LOSS
0.8
LLM Generated Question:What type of lawsuit is being filed by Apple vs. Amazon.com?
Expected Question (From Dataset):What is the primary claim made by Apple Against Amazon?
Context (From Dataset):Input: Answer: Trademark infringement/dilution and false advertising Context: This is a trademark infringement/dilution and false advertising case. Plaintiff Apple Inc. (“Apple”) alleges that defendant Amazon.com Inc. (“Amazon”) has been improperly using the term “APP STORE” in connection with sales of apps for Android devices and the Kindle Fire (Amazon’s tablet computer).
100%
QUESTIONRELIABILITY
90%
F1
94%
RECALL
87%
PRECISION
MODEL EVALUATION
FLAN-T5-LARGE: EPOCH 3
Performance on outside instances (DEMO)
Here are some of ours:
Fools Gold
Small datasets can mislead model bias-variance evaluation
Split Decisions
Model sensitive to split choice
Precision Postponed
Minor impact on small datasets, better for final optimization efforts.
Ignorance (is not) Bliss
Higher rates improve robustness
Training Trifecta
Context, question, answer integration essential in fine-tuning.
Less is More
Larger models need more data, risk overfitting.
"The power of data lies not in its volume, but in its interpretation."
tech report
RECOMMENDATIONS FOR IMPROVEMENTS
06
NEXT STEPS
Optimize questions for added complexity and sophistication, through increase in token limit, and advancing algorithms
INCREASE COMPLEXITY
Add answering capability to LLM for comprehensive, practical use for complete end to end usage.
INCORPORATE ANSWERING
Evolve text-input tool to support various formats, batch processing, and LLM integration.
INCREASE TOOL VERSATILITY
Expand manual dataset for better performance, prioritizing data over increasing model size.
EXPAND THE DATASET
Revise two-step fine-tuning with legal dataset and manual labels, adding intermediate knowledge step.
INTERMEDIARY FINE-TUNING
RECOMMENDATIONS FOR ADDED PROGRESS
OPEN FLOOR FOR DISCUSSION
07
Questions & Answering
WRITE ATITLE HERE
Lorem ipsum dolor sit amet, consectetuer adipiscing
Lorem ipsum dolor sit amet, consectetuer adipiscing
+ 45k
+ 85k
Lorem ipsum dolor sit amet, consectetuer adipiscing
Lorem ipsum dolor sit amet, consectetuer adipiscing
+ 12k
+ 190
Use tables and infographics
Disciplines such as ‘Visual Thinking’ facilitate the taking of visually rich notes through the use of images, graphs, infographics, and simple drawings.
23K
ProjectNetwork
An Overview of our Partner and Team
TITLE YOUR SECTION HERE
Write a subtitle here
패션
+ Info
WRITE ATITLE HERE
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam eget nisl dictum, blandit est id, bibendum lacus. Sed mi odio, ullamcorper et eros eu, ferment pulvinar tellus. Nullam porttitor dolor vel posuere pulvinar. Ut a aliquam metus. Proin maximus felis augue, at accumsan felis fringilla nec. Morbi a tempor sapien. Nullam volutpat turpis dui, luctus molestie arcu faucibus id.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetur elit. Donec elementum metus auctor metus varius pulv. Donec finibus faucibus justo, id bibendum arcu accumsan in. Sed rhoncus, sapien eget laoreet congue, ante mi aliquet lorem, sit amet egestas magna leo.
WRITE A TITLE HERE
패션
Lorem ipsum dolor sit amet, consectetur adipiscing.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit
WRITE ATITLE HERE
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
WRITE A TITLE HERE
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris sodales id elit et.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris sodales id elit et.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris sodales id elit et.
+ Info
+ Info
+ Info
Euismod tincidunt ut laoreet
Euismod tincidunt ut laoreet
Euismod tincidunt ut laoreet
Euismod tincidunt ut laoreet
Euismod tincidunt ut laoreet
Euismod tincidunt ut laoreet
WRITE A TITLE HERE
author's name
" Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat "
Sed posuere nunc vel arcu auctor consequat. Vestibulum vehicula mi eget mauris dignissim scelerisque. Nunc luctus imperdiet nibh a feugiat. Cras egestas suscipit odio, vel laoreet diam molestie ac. Fusce laoreet quam lorem, eu commodo sapien ullamcorper sed.Cras in leo interdum sem vehicula dignissim et accumsan orci. Curabitur placerat auctor consequat. Quisque posuere est nisl, vitae auctor ante.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur ac nunc in sapien euismod ornare. Nunc maximus nec risus ac tincidunt. Phasellus egestas nec arcu eget auctor. Maecenas commodo nulla risus, in dignissim nisl tristique id. Maecenas consectetur pharetra lorem non placerat.
write atitle here
20XX
20XX
20XX
20XX
20XX
Lorem ipsum dolor sit amet consequiat
Lorem ipsum dolor sit amet consequiat
Lorem ipsum dolor sit amet consequiat
Lorem ipsum dolor sit amet consequiat
Lorem ipsum dolor sit amet consequiat
Timeline
WRITE ATITLE HERE
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh.
25%
65%
Lorem ipsum dolor sit amet
+ Info
VS
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
WRITE ATITLE HERE
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
- Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
WRITE ATITLE HERE
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
WRITE A TITLE HERE
VS
WRITE ATITLE HERE
Lorem ipsum dolor sit amet, consectetuer adipiscing
Lorem ipsum dolor sit amet, consectetuer adipiscing
+ 45k
+ 85k
Lorem ipsum dolor sit amet, consectetuer adipiscing
Lorem ipsum dolor sit amet, consectetuer adipiscing
+ 12k
+ 190
WRITE A TITLE HERE
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam eget nisl dictum, blandit est id, bibendum lacus. Sed mi odio, ullamcorper et eros eu, fermentum pulvinar tellus. Nullam porttitor dolor vel posuere pulvinar. Ut a aliquam metus. Proin maximus felis augue, at accumsan felis fringilla nec.
+ Info
thanks!
패션
Got an idea?
Let the communication flow!
With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.
- Generate experiences with your content.
- It’s got the Wow effect. Very Wow.
- Make sure your audience remembers the message.
Got an idea?
Let the communication flow!
With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.
- Generate experiences with your content.
- It’s got the Wow effect. Very Wow.
- Make sure your audience remembers the message.
Got an idea?
Let the communication flow!
With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.
- Generate experiences with your content.
- It’s got the Wow effect. Very Wow.
- Make sure your audience remembers the message.
Got an idea?
Let the communication flow!
With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.
- Generate experiences with your content.
- It’s got the Wow effect. Very Wow.
- Make sure your audience remembers the message.
Got an idea?
Let the communication flow!
With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.
- Generate experiences with your content.
- It’s got the Wow effect. Very Wow.
- Make sure your audience remembers the message.
Got an idea?
Let the communication flow!
With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.
- Generate experiences with your content.
- It’s got the Wow effect. Very Wow.
- Make sure your audience remembers the message.