GP-GPT: Large Language Model for Gene-Phenotype Mapping
Authors: Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu Professor Yanfu Zhang Department of Computer Science, University of William & Mary Presented by: Matin Rouzehkhah Azad Master Student of Data Science, University of Bergamo Date: November 13, 2024
Presentation Overview
- Introduction & Motivation
- Objectives
- Training Process
- Model Architecture
- Key Results
- Conclusion
Introduction & Motivation
Overview of Gene-Phenotype Relationships
- Understanding gene-phenotype interactions is crucial for insights into genetic diseases and their underlying biological processes.
- Current research often focuses on individual genes or simple interactions, missing the broader complexity of multi-source genomic data.
Importance of GP-GPT in Genomics
- GP-GPT is designed as the first large language model specifically for mapping gene-phenotype relationships, aiming to unify diverse genomic data sources into a comprehensive knowledge representation.
Background & Objectives
What is the problem?
- Standard bioinformatics models struggle with multi-source data integration, complex gene-disease relationships, and precise knowledge extraction from unstructured data.
Objectives of GP-GPT Develop a model that can:
- Accurately retrieve and map genetic information
- Determine relationships across multi-level bio-factors (genes, proteins, phenotypes)
- Outperform current state-of-the-art models in genetic information retrieval and relationship analysis
Training Process: Data and Methodologies Used
Data Collection
- Over 2.4 million gene, protein, and phenotype contexts
- Sources: OMIM, dbGaP, DisGeNET, UniPort, NCBI
Multi-Level, Multi-Task Corpus
- Gene-phenotype, gene-protein, protein-phenotype relationships
- Text data formatted into specific contexts for different tasks
Two-Stage Fine-Tuning Approach
- Stage 1: Instruction mask prediction (masked text for entity recognition)
- Stage 2: Supervised fine-tuning with question-answer format
Parameter-Efficient Techniques
- LoRA and QLoRA used for efficient tuning of LLaMA models
Model Architecture: Key Components of GP-GPT
Foundation Model
- Based on the LLaMA family (LLaMA 2, 3)
Two-Stage Fine-Tuning Process
- LoRA and QLoRA for efficient parameter adaptation
- Minimal computational resources required for fine-tuning
Multi-Source Genomics Data Integration
- Data from OMIM, dbGaP, DisGeNET, UniPort
Multi-Level Bio-Factor Representation
- Gene, protein, and phenotype/disease entities
Tokenization & Embedding
- Character-based tokenization to handle gene IDs, proteins, diseases
- Embedding of bio-factors for improved relational mapping
Prompt-Based Structure
- Instruction and question-answer formats for flexible NLP tasks
Key Results: Model Performance and Impact
Performance Metrics
- QA Accuracy: High accuracy in gene-disease association retrieval
- BLEU Scores: Achieved top BLEU and BLEU-1 scores compared to other LLMs
Model Comparisons
- Outperformed GPT-4, BioGPT, and LLaMA models in genetic information recall
- Superior accuracy in gene-phenotype relation determination tasks
Embedding Visualizations
- Clear clustering of gene and phenotype embeddings
- Improved entity representation and mapping of bio-factors
Impact on Genomics Research
- Enhanced ability to map gene-disease relationships
- Potential for faster, more accurate insights in genetic disease research
Conclusion: Contributions and Future Directions
Contributions
- First Large Language Model for multi-level gene-phenotype mapping
- Improved Gene-Disease Mapping: Achieved high accuracy and efficiency in genomics tasks
- Enhanced Bio-Factor Representation: Effective embeddings for genes, proteins, and phenotypes
Future Directions
- Expand Dataset Sources: Incorporate more diverse bio-text data and genomic sequences
- Multi-Modality Integration: Add biological sequence data, imaging, and other data types
- Applications in Genetic Disease Prediction: Use GP-GPT for AI-assisted diagnostics and large-scale studies
Reference
GP-GPT: Large Language Model for Gene-Phenotype Mapping. Department of Computer Science and Engineering, The University of Texas at Arlington. arXiv preprint [arXiv:2409.09825v2], September 2024.
Thank you for your time and consideration
GP_GPT
Matin Rouzehkhah Azad
Created on November 30, 2023
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Modern Presentation
View
Terrazzo Presentation
View
Colorful Presentation
View
Modular Structure Presentation
View
Chromatic Presentation
View
City Presentation
View
News Presentation
Explore all templates
Transcript
GP-GPT: Large Language Model for Gene-Phenotype Mapping
Authors: Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu Professor Yanfu Zhang Department of Computer Science, University of William & Mary Presented by: Matin Rouzehkhah Azad Master Student of Data Science, University of Bergamo Date: November 13, 2024
Presentation Overview
Introduction & Motivation
Overview of Gene-Phenotype Relationships
- Understanding gene-phenotype interactions is crucial for insights into genetic diseases and their underlying biological processes.
- Current research often focuses on individual genes or simple interactions, missing the broader complexity of multi-source genomic data.
Importance of GP-GPT in GenomicsBackground & Objectives
What is the problem?
- Standard bioinformatics models struggle with multi-source data integration, complex gene-disease relationships, and precise knowledge extraction from unstructured data.
Objectives of GP-GPT Develop a model that can:Training Process: Data and Methodologies Used
Data Collection
- Over 2.4 million gene, protein, and phenotype contexts
- Sources: OMIM, dbGaP, DisGeNET, UniPort, NCBI
Multi-Level, Multi-Task Corpus- Gene-phenotype, gene-protein, protein-phenotype relationships
- Text data formatted into specific contexts for different tasks
Two-Stage Fine-Tuning Approach- Stage 1: Instruction mask prediction (masked text for entity recognition)
- Stage 2: Supervised fine-tuning with question-answer format
Parameter-Efficient TechniquesModel Architecture: Key Components of GP-GPT
Foundation Model
- Based on the LLaMA family (LLaMA 2, 3)
Two-Stage Fine-Tuning Process- LoRA and QLoRA for efficient parameter adaptation
- Minimal computational resources required for fine-tuning
Multi-Source Genomics Data Integration- Data from OMIM, dbGaP, DisGeNET, UniPort
Multi-Level Bio-Factor Representation- Gene, protein, and phenotype/disease entities
Tokenization & Embedding- Character-based tokenization to handle gene IDs, proteins, diseases
- Embedding of bio-factors for improved relational mapping
Prompt-Based StructureKey Results: Model Performance and Impact
Performance Metrics
- QA Accuracy: High accuracy in gene-disease association retrieval
- BLEU Scores: Achieved top BLEU and BLEU-1 scores compared to other LLMs
Model Comparisons- Outperformed GPT-4, BioGPT, and LLaMA models in genetic information recall
- Superior accuracy in gene-phenotype relation determination tasks
Embedding Visualizations- Clear clustering of gene and phenotype embeddings
- Improved entity representation and mapping of bio-factors
Impact on Genomics ResearchConclusion: Contributions and Future Directions
Contributions
- First Large Language Model for multi-level gene-phenotype mapping
- Improved Gene-Disease Mapping: Achieved high accuracy and efficiency in genomics tasks
- Enhanced Bio-Factor Representation: Effective embeddings for genes, proteins, and phenotypes
Future DirectionsReference
GP-GPT: Large Language Model for Gene-Phenotype Mapping. Department of Computer Science and Engineering, The University of Texas at Arlington. arXiv preprint [arXiv:2409.09825v2], September 2024.
Thank you for your time and consideration