LangCell: Language-Cell Pre-training for Cell Identity Understanding

Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie·May 09, 2024

Summary

LangCell is a novel pre-training framework for single-cell language models that enhances cell identity understanding in bioinformatics by incorporating cross-modal knowledge from enriched text and cell identity information. It addresses the challenge of limited labeled data by outperforming existing models in zero-shot, few-shot, and fine-tuning scenarios. LangCell's design includes a cell-text dataset (scLibrary), a unified language-cell framework, and four pre-training tasks to improve single-cell representation and link recognition. Key contributions include state-of-the-art performance in cell type annotation, batch integration, and new tasks like cell-text retrieval and cancer subtype classification. The model's success lies in its ability to bridge the gap between scRNA-seq data and textual information, making it a valuable tool for biomedical research, especially in scenarios with scarce data.

Key findings

12

Tables

9

Advanced features