OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling

Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Lars Schimmelpfennig, Levi Kaster, Di Huang, Carlos Cruchaga, Guangfu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li·April 02, 2025

Summary

OmniCellTOSG is a dataset for joint LLM and GNN modeling, focusing on cell text-omic signaling graphs. It supports research into normal development and diseases, combining human annotations with numeric graph values. Structured for PyTorch, it aids advanced model development, using data from the Gene Expression Omnibus, excluding duplicates, and converted into H5AD or H5 formats for compatibility with CellTypist. This dataset enhances management of large genomic data, benefiting life sciences, healthcare, and related fields.

Introduction
Background
Overview of LLM (Language-based Learning Models) and GNN (Graph Neural Networks) in biological research
Importance of integrating textual and graph-based data in understanding cell signaling pathways
Objective
To present OmniCellTOSG, a comprehensive dataset designed for joint modeling of language and graph data in cell text-omic signaling graphs
Highlighting its utility in advancing research on normal development and disease mechanisms
Dataset Overview
Data Sources
Description of the Gene Expression Omnibus (GEO) data used as the primary source
Explanation of the exclusion criteria for duplicates
Data Format
Conversion of data into H5AD or H5 formats for compatibility with CellTypist
Compatibility with PyTorch for advanced model development
Dataset Components
Human Annotations
Explanation of the role of human annotations in enriching the dataset
Numeric Graph Values
Description of the numeric values associated with graph nodes and edges
Applications
Life Sciences
Overview of how OmniCellTOSG benefits research in life sciences
Healthcare
Discussion on the potential impact on personalized medicine and disease diagnosis
Related Fields
Mention of applications in bioinformatics, computational biology, and systems biology
Dataset Management
Large Genomic Data Handling
Explanation of how OmniCellTOSG facilitates efficient management of large genomic datasets
Data Accessibility
Discussion on the availability and access mechanisms for the dataset
Conclusion
Summary of Key Features
Recap of the main features and benefits of OmniCellTOSG
Future Directions
Potential areas for further research and development with the dataset
Basic info
papers
machine learning
artificial intelligence
Advanced features
Insights
What innovative features does the OmniCellTOSG dataset offer for managing large genomic data?
In what ways does the OmniCellTOSG dataset ensure compatibility with CellTypist?
What are the key steps in preparing the OmniCellTOSG dataset for use with PyTorch?
How does the OmniCellTOSG dataset integrate LLM and GNN modeling for cell text-omic signaling graphs?