Want to create interactive content? It’s easy in Genially!

Get started free

GP_GPT

Matin Rouzehkhah Azad

Created on November 30, 2023

Start designing with a free template

Discover more than 1500 professional designs like these:

Modern Presentation

Terrazzo Presentation

Colorful Presentation

Modular Structure Presentation

Chromatic Presentation

City Presentation

News Presentation

Transcript

GP-GPT: Large Language Model for Gene-Phenotype Mapping

Authors: Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu Professor Yanfu Zhang Department of Computer Science, University of William & Mary Presented by: Matin Rouzehkhah Azad Master Student of Data Science, University of Bergamo Date: November 13, 2024

Presentation Overview

  • Introduction & Motivation
  • Objectives
  • Training Process
  • Model Architecture
  • Key Results
  • Conclusion

Introduction & Motivation

Overview of Gene-Phenotype Relationships

  • Understanding gene-phenotype interactions is crucial for insights into genetic diseases and their underlying biological processes.
  • Current research often focuses on individual genes or simple interactions, missing the broader complexity of multi-source genomic data.
Importance of GP-GPT in Genomics
  • GP-GPT is designed as the first large language model specifically for mapping gene-phenotype relationships, aiming to unify diverse genomic data sources into a comprehensive knowledge representation.

Background & Objectives

What is the problem?

  • Standard bioinformatics models struggle with multi-source data integration, complex gene-disease relationships, and precise knowledge extraction from unstructured data.
Objectives of GP-GPT Develop a model that can:
  • Accurately retrieve and map genetic information
  • Determine relationships across multi-level bio-factors (genes, proteins, phenotypes)
  • Outperform current state-of-the-art models in genetic information retrieval and relationship analysis

Training Process: Data and Methodologies Used

Data Collection

  • Over 2.4 million gene, protein, and phenotype contexts
  • Sources: OMIM, dbGaP, DisGeNET, UniPort, NCBI
Multi-Level, Multi-Task Corpus
  • Gene-phenotype, gene-protein, protein-phenotype relationships
  • Text data formatted into specific contexts for different tasks
Two-Stage Fine-Tuning Approach
  • Stage 1: Instruction mask prediction (masked text for entity recognition)
  • Stage 2: Supervised fine-tuning with question-answer format
Parameter-Efficient Techniques
  • LoRA and QLoRA used for efficient tuning of LLaMA models

Model Architecture: Key Components of GP-GPT

Foundation Model

  • Based on the LLaMA family (LLaMA 2, 3)
Two-Stage Fine-Tuning Process
  • LoRA and QLoRA for efficient parameter adaptation
  • Minimal computational resources required for fine-tuning
Multi-Source Genomics Data Integration
  • Data from OMIM, dbGaP, DisGeNET, UniPort
Multi-Level Bio-Factor Representation
  • Gene, protein, and phenotype/disease entities
Tokenization & Embedding
  • Character-based tokenization to handle gene IDs, proteins, diseases
  • Embedding of bio-factors for improved relational mapping
Prompt-Based Structure
  • Instruction and question-answer formats for flexible NLP tasks

Key Results: Model Performance and Impact

Performance Metrics

  • QA Accuracy: High accuracy in gene-disease association retrieval
  • BLEU Scores: Achieved top BLEU and BLEU-1 scores compared to other LLMs
Model Comparisons
  • Outperformed GPT-4, BioGPT, and LLaMA models in genetic information recall
  • Superior accuracy in gene-phenotype relation determination tasks
Embedding Visualizations
  • Clear clustering of gene and phenotype embeddings
  • Improved entity representation and mapping of bio-factors
Impact on Genomics Research
  • Enhanced ability to map gene-disease relationships
  • Potential for faster, more accurate insights in genetic disease research

Conclusion: Contributions and Future Directions

Contributions

  • First Large Language Model for multi-level gene-phenotype mapping
  • Improved Gene-Disease Mapping: Achieved high accuracy and efficiency in genomics tasks
  • Enhanced Bio-Factor Representation: Effective embeddings for genes, proteins, and phenotypes
Future Directions
  • Expand Dataset Sources: Incorporate more diverse bio-text data and genomic sequences
  • Multi-Modality Integration: Add biological sequence data, imaging, and other data types
  • Applications in Genetic Disease Prediction: Use GP-GPT for AI-assisted diagnostics and large-scale studies

Reference

GP-GPT: Large Language Model for Gene-Phenotype Mapping. Department of Computer Science and Engineering, The University of Texas at Arlington. arXiv preprint [arXiv:2409.09825v2], September 2024.

Thank you for your time and consideration