Want to create interactive content? It’s easy in Genially!

GP_GPT

Matin Rouzehkhah Azad

Created on November 30, 2023

Start designing with a free template

Discover more than 1500 professional designs like these:

Modern Presentation

Terrazzo Presentation

Colorful Presentation

Modular Structure Presentation

Chromatic Presentation

City Presentation

News Presentation

Explore all templates

GP-GPT: Large Language Model for Gene-Phenotype Mapping

Authors: Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu Professor Yanfu Zhang Department of Computer Science, University of William & Mary Presented by: Matin Rouzehkhah Azad Master Student of Data Science, University of Bergamo Date: November 13, 2024

Presentation Overview

Introduction & Motivation

Objectives
Training Process
Model Architecture
Key Results
Conclusion

Introduction & Motivation

Overview of Gene-Phenotype Relationships

Understanding gene-phenotype interactions is crucial for insights into genetic diseases and their underlying biological processes.
Current research often focuses on individual genes or simple interactions, missing the broader complexity of multi-source genomic data.

Importance of GP-GPT in Genomics

GP-GPT is designed as the first large language model specifically for mapping gene-phenotype relationships, aiming to unify diverse genomic data sources into a comprehensive knowledge representation.

Background & Objectives

What is the problem?

Standard bioinformatics models struggle with multi-source data integration, complex gene-disease relationships, and precise knowledge extraction from unstructured data.

Objectives of GP-GPT Develop a model that can:

Accurately retrieve and map genetic information
Determine relationships across multi-level bio-factors (genes, proteins, phenotypes)
Outperform current state-of-the-art models in genetic information retrieval and relationship analysis

Training Process: Data and Methodologies Used

Data Collection

Over 2.4 million gene, protein, and phenotype contexts
Sources: OMIM, dbGaP, DisGeNET, UniPort, NCBI

Multi-Level, Multi-Task Corpus

Gene-phenotype, gene-protein, protein-phenotype relationships
Text data formatted into specific contexts for different tasks

Two-Stage Fine-Tuning Approach

Stage 1: Instruction mask prediction (masked text for entity recognition)
Stage 2: Supervised fine-tuning with question-answer format

Parameter-Efficient Techniques

LoRA and QLoRA used for efficient tuning of LLaMA models

Model Architecture: Key Components of GP-GPT

Foundation Model

Based on the LLaMA family (LLaMA 2, 3)

Two-Stage Fine-Tuning Process

LoRA and QLoRA for efficient parameter adaptation
Minimal computational resources required for fine-tuning

Multi-Source Genomics Data Integration

Data from OMIM, dbGaP, DisGeNET, UniPort

Multi-Level Bio-Factor Representation

Gene, protein, and phenotype/disease entities

Tokenization & Embedding

Character-based tokenization to handle gene IDs, proteins, diseases
Embedding of bio-factors for improved relational mapping

Prompt-Based Structure

Instruction and question-answer formats for flexible NLP tasks

Key Results: Model Performance and Impact

Performance Metrics

QA Accuracy: High accuracy in gene-disease association retrieval
BLEU Scores: Achieved top BLEU and BLEU-1 scores compared to other LLMs

Model Comparisons

Outperformed GPT-4, BioGPT, and LLaMA models in genetic information recall
Superior accuracy in gene-phenotype relation determination tasks

Embedding Visualizations

Clear clustering of gene and phenotype embeddings
Improved entity representation and mapping of bio-factors

Impact on Genomics Research

Enhanced ability to map gene-disease relationships
Potential for faster, more accurate insights in genetic disease research

Conclusion: Contributions and Future Directions

Contributions

First Large Language Model for multi-level gene-phenotype mapping
Improved Gene-Disease Mapping: Achieved high accuracy and efficiency in genomics tasks
Enhanced Bio-Factor Representation: Effective embeddings for genes, proteins, and phenotypes

Future Directions

Expand Dataset Sources: Incorporate more diverse bio-text data and genomic sequences
Multi-Modality Integration: Add biological sequence data, imaging, and other data types
Applications in Genetic Disease Prediction: Use GP-GPT for AI-assisted diagnostics and large-scale studies

Reference

GP-GPT: Large Language Model for Gene-Phenotype Mapping. Department of Computer Science and Engineering, The University of Texas at Arlington. arXiv preprint [arXiv:2409.09825v2], September 2024.

View

Modern Presentation

View

Terrazzo Presentation

View

Colorful Presentation

View

Modular Structure Presentation

View

Chromatic Presentation

View

City Presentation

View

News Presentation

GP_GPT

Start designing with a free template

View

Modern Presentation

View

Terrazzo Presentation

View

Colorful Presentation

View

Modular Structure Presentation

View

Chromatic Presentation

View

City Presentation

View

News Presentation

Transcript

GP-GPT: Large Language Model for Gene-Phenotype Mapping

Presentation Overview

Introduction & Motivation

Background & Objectives

Training Process: Data and Methodologies Used

Model Architecture: Key Components of GP-GPT

Key Results: Model Performance and Impact

Conclusion: Contributions and Future Directions

Reference

Thank you for your time and consideration