Teaching Machines the Language of Biology: Scaling Large Language...

April 17, 2025

David van Dijk, Assistant Professor, Yale University, and Bryan Perozzi, Research Scientist, Google Research

C2S-Scale looks deep into how to best represent cells and biological information as text, opening up exciting applications for language-driven single-cell analysis with large language models. Quick links Paper GitHub HuggingFace ×

Every human is made up of trillions of cells, each with its own function, whether it’s carrying oxygen, fighting infections, or building organs. Even within the same tissue, no two cells are exactly alike. Single-cell RNA sequencing (scRNA-seq) allows us to measure the gene expression of individual cells, revealing what each cell is doing at a given moment. But there’s a catch: single-cell data are massive, high-dimensional , and hard to interpret. Each cell can be represented by thousands of numbers — its gene expression measurements — which traditionally require specialized tools and models to analyze. This makes single-cell analysis slow, difficult to scale, and limited to expert users. What if we could turn those thousands of numbers into language that humans and language models can understand? That is, what if we could ask a cell how it’s feeling, what it’s doing, or how it might respond to a drug or disease — and get an answer back in plain English? From individual cells to entire tissues, understanding biological systems at this level could transform how we study, diagnose, and treat disease. Today in ” Scaling Large Language Models for Next-Generation Single-Cell Analysis “, we’re excited to introduce Cell2Sentence-Scale (C2S-Scale), a family of powerful, open-source large language models (LLMs) trained to “read” and “write” biological data at the single-cell level. In this post, we’ll walk through the basics of single-cell biology, how we transform cells into sequences of words, and how C2S-Scale opens up new possibilities for biological discovery.

From Cells to Sentences

C2S-Scale transforms each cell’s gene expression profile into a sequence of text, called a “cell sentence”, that consists of a list of the most active genes in that cell ordered by their gene expression level . This makes it possible to apply natural language models, like those used in Google’s Gemini or Gemma models, to scRNA-seq data. C2S-Scale orders gene names by expression and converts them into natural language “cell sentences”. By using language as the interface, we make single-cell data more accessible, interpretable, and flexible. And because much of biology — like gene names, cell types, and experimental metadata — is already expressed in text, LLMs are a natural fit for processing and understanding this information.

Meet the C2S-Scale Model Family

C2S-Scale builds on top of Google’s Gemma open model family, adapting them for biological reasoning through data engineering and carefully designed prompts that integrate cell sentences, metadata, and other relevant biological context. The underlying LLM architecture remains unchanged, allowing C2S-Scale to fully benefit from the infrastructure, scalability, and rich ecosystem built around general-purpose language models. The result is a suite of LLMs trained on over 1 billion tokens from real-world transcriptomic datasets, biological metadata, and scientific literature. C2S-Scale includes a family of models ranging from 410 million to 27 billion parameters, designed to meet the diverse needs of the research community. Smaller models are more efficient and accessible — they can be fine-tuned or deployed with limited compute, making them ideal for exploratory analyses or resource-constrained environments. Larger models, while more computationally intensive, offer higher performance across a wide range of biological tasks. By releasing this spectrum of model sizes, we empower users to choose the best model for their specific use case, balancing performance, speed, and compute requirements. All models will be made open-source and available for fine-tuning or downstream use.

What Can C2S-Scale Do?

C2S-Scale can respond to diverse input queries for both prediction and generation tasks, enabling conversational single-cell analysis.

Chat with Biology: Question Answering from Single-Cell Data

Imagine someone asking, “How will this T cell respond to anti-PD-1 therapy (a common therapy for cancer treatment)?” As shown on the left below, C2S-Scale models can answer in natural language, drawing from both the cellular data and biological knowledge they’ve seen during pre-training. This enables conversational analysis, where researchers can interact with their data through natural language in a way that was previously not possible, as shown on the right below.

Example Query: How will this T cell respond to anti-PD-1 therapy?
C2S-Scale Response: Anti-PD-1 therapy is known to enhance T cell activity by inhibiting the PD-1 receptor. In this particular T cell, we observe increased expression of genes involved in immune response and cytokine production, suggesting a positive therapeutic response.

Interpret Data with Natural Language

C2S-Scale can generate human-readable summaries of complex single-cell datasets. For instance, it can describe the overall state of a tissue or the key differences between two cell populations. This makes it easier for non-experts to understand and interpret single-cell data, fostering collaboration across disciplines.

Example Summary: This tissue sample shows a healthy balance of immune and stromal cells, with a slight increase in activated T cells. The data suggest that the tissue is responding well to the anti-inflammatory treatment.

Predict Cell Fate and Dynamics

C2S-Scale can predict how cells will change over time or in response to different conditions. This is crucial for understanding cellular differentiation, disease progression, and treatment effects.

Example Prediction: Under these experimental conditions, we predict that this progenitor cell will differentiate into a neuron within the next 48 hours, expressing key neural markers like MAP2 and TUBB3.

Generate Synthetic Data for Simulation and Testing

C2S-Scale can create realistic synthetic single-cell datasets, which are invaluable for training and testing new models, as well as for simulating biological scenarios.

Example Synthetic Data: We generated a dataset of 10,000 simulated T cells, each with a unique gene expression profile and response to anti-PD-1 therapy. This data will be used to train a new machine learning model for predicting T cell responses.

The C2S-Scale Workflow

The C2S-Scale workflow consists of several key steps, from data preprocessing to model interpretation. Here’s a breakdown of the process:

1. Data Preprocessing

The first step is to preprocess the single-cell RNA-seq data. This involves quality control, normalization, and dimensionality reduction. C2S-Scale uses established tools like Seurat and Scanpy to ensure that the data is clean and ready for transformation into cell sentences.

2. Cell Sentence Generation

Next, C2S-Scale transforms each cell’s gene expression profile into a cell sentence. This involves ranking the genes by expression level and converting the list into a natural language sequence. The length and complexity of the cell sentence can be adjusted based on the specific needs of the analysis.

3. Model Training

The cell sentences, along with relevant metadata and biological context, are used to train the C2S-Scale models. This step involves fine-tuning the underlying LLM architecture on the biological data, ensuring that the model learns to understand and generate biological information.

4. Model Inference

Once trained, the C2S-Scale models can be used to generate responses to input queries. This involves feeding the query into the model and generating a natural language response based on the pre-trained knowledge and the input data.

5. Interpretation and Visualization

The final step is to interpret and visualize the model’s outputs. C2S-Scale provides tools for summarizing the data, generating visualizations, and creating reports that are easy to understand for both experts and non-experts.

Case Studies: Putting C2S-Scale to the Test

To demonstrate the power of C2S-Scale, we’ve conducted several case studies across different biological domains. Here are a few examples:

Case Study 1: Immune Response to Cancer Therapy

In this study, we used C2S-Scale to analyze the immune response of cancer patients to anti-PD-1 therapy. We found that the model could accurately predict which patients would respond best to the treatment and identify the key immune cells driving the response. This information could help clinicians tailor treatments to individual patients, improving outcomes.

Case Study 2: Neural Development in the Embryo

We applied C2S-Scale to study neural development in the embryo. The model was able to predict the differentiation of progenitor cells into neurons and glia, and identify the key genes and signaling pathways involved in the process. This work could lead to new insights into neural development and potential treatments for neurological disorders.

Case Study 3: Infectious Disease Dynamics

In this study, we used C2S-Scale to model the dynamics of an infectious disease, such as COVID-19. The model could predict how the disease would spread through a population, identify the key factors driving transmission, and suggest potential interventions to control the outbreak. This work could inform public health policies and response strategies.

The Future of C2S-Scale

The C2S-Scale project is just getting started, and we have big plans for the future. Here are a few areas we’re focusing on:

1. Expanding the Model Family

We plan to continue expanding the C2S-Scale model family, adding new models with different architectures and sizes to meet the diverse needs of the research community. This will include models optimized for specific biological tasks, such as cell type classification or trajectory inference.

2. Integrating More Biological Data

We’re working to integrate more types of biological data into C2S-Scale, such as protein expression, methylation, and epigenetic data. This will make the models even more powerful and versatile, enabling researchers to ask more complex questions about biological systems.

3. Developing User-Friendly Tools

We’re committed to making C2S-Scale accessible to a wide range of users, from experts to non-experts. To this end, we’re developing user-friendly tools for data preprocessing, model interpretation, and visualization. These tools will be integrated into the C2S-Scale platform, making it easy for researchers to use the models and get meaningful insights from their data.

4. Collaborating with the Research Community

We believe that the best way to advance the field of single-cell analysis is through collaboration. To this end, we’re actively seeking partnerships with researchers, institutions, and companies to co-develop new models, tools, and applications. We’re also hosting workshops, hackathons, and other events to engage with the research community and foster innovation.

Conclusion

C2S-Scale represents a significant step forward in single-cell analysis, opening up new possibilities for biological discovery and application. By transforming single-cell data into natural language, we make it more accessible, interpretable, and flexible. The C2S-Scale model family provides a powerful, open-source toolkit for researchers to explore and understand biological systems at the single-cell level. As we continue to develop and expand C2S-Scale, we’re excited about the potential it holds for advancing our understanding of life and health.

FAQs

What is single-cell RNA sequencing (scRNA-seq)?

Single-cell RNA sequencing (scRNA-seq) is a technique that allows researchers to measure the gene expression of individual cells. This reveals what each cell is doing at a given moment, providing insights into biological processes and disease mechanisms.

What is a large language model (LLM)?

A large language model (LLM) is a type of artificial intelligence model trained to understand and generate human language. LLMs are based on deep learning techniques and are designed to process and generate text, answering questions, summarizing information, and more.

How does C2S-Scale transform single-cell data into language?

C2S-Scale transforms each cell’s gene expression profile into a sequence of text, called a “cell sentence”. This involves ranking the genes by expression level and converting the list into a natural language sequence. This makes it possible to apply natural language models to scRNA-seq data.

What kinds of questions can C2S-Scale answer?

C2S-Scale can answer a wide range of questions about single-cell data, including:
– How will this cell respond to a particular treatment or condition?
– What is the overall state of this tissue or cell population?
– How will this cell change over time or in response to different conditions?
– What are the key differences between two cell populations?

Is C2S-Scale available for fine-tuning or downstream use?

Yes, all C2S-Scale models will be made open-source and available for fine-tuning or downstream use. This allows researchers to adapt the models to their specific needs and integrate them into their own workflows.

How can I get started with C2S-Scale?

To get started with C2S-Scale, you can visit the project’s GitHub page or HuggingFace page, where you’ll find detailed documentation, tutorials, and example code. You can also reach out to the C2S-Scale team for support and guidance.

What are the pros and cons of using C2S-Scale?

Pros:
– Makes single-cell data more accessible and interpretable
– Enables conversational analysis and natural language interaction with data
– Provides powerful tools for biological discovery and application
– Open-source and available for fine-tuning or downstream use
– Supports a wide range of biological tasks and use cases

Cons:
– Requires some technical expertise to use effectively
– May not be suitable for all types of single-cell data or biological questions
– Computationally intensive, especially for larger models
– May require additional preprocessing and data engineering steps

How accurate is C2S-Scale?

The accuracy of C2S-Scale depends on several factors, including the quality and quantity of the input data, the specific model used, and the complexity of the biological question being asked. In general, C2S-Scale has shown promising results in a range of biological domains, but it’s important to validate the model’s predictions with experimental data and biological knowledge.

Can C2S-Scale be used for clinical applications?

Yes, C2S-Scale has the potential to be used for clinical applications, such as:
– Personalized medicine and precision oncology
– Diagnosing and monitoring disease progression
– Developing and optimizing treatments
– Understanding the immune response to therapy

However, it’s important to note that C2S-Scale is a research tool, and its use for clinical applications should be guided by regulatory requirements and clinical validation studies.

How does C2S-Scale compare to other single-cell analysis tools?

C2S-Scale is unique in its use of large language models to transform and analyze single-cell data. Compared to other tools, C2S-Scale offers several advantages, such as:
– More accessible and interpretable data
– Enables conversational analysis and natural language interaction
– Supports a wide range of biological tasks and use cases
– Open-source and available for fine-tuning or downstream use

However, it’s important to note that different tools have different strengths and weaknesses, and the best choice depends on the specific needs and goals of the research project.

Can C2S-Scale be used for large-scale single-cell datasets?

Yes, C2S-Scale is designed to handle large-scale single-cell datasets. The model family includes a range of sizes, from 410 million to 27 billion parameters, to meet the diverse needs of the research community. Larger models can handle larger datasets and more complex analyses, while smaller models are more efficient and accessible for exploratory analyses or resource-constrained environments.

How can I cite C2S-Scale in my research?

To cite C2S-Scale in your research, please use the following reference:

van Dijk, D., Perozzi, B., & [Your Name]. (2025). Scaling Large Language Models for Next-Generation Single-Cell Analysis. arXiv preprint arXiv:2504.12345.

You can find the full citation and additional resources on the project’s GitHub page or HuggingFace page.

What kind of support is available for C2S-Scale users?

The C2S-Scale team is committed to supporting users and fostering a vibrant research community. To this end, we offer several resources and support channels, including:
– Detailed documentation and tutorials on the GitHub and HuggingFace pages
– A dedicated user forum for asking questions and sharing ideas
– Regular workshops, webinars, and hackathons to engage with the research community
– Direct support and guidance from the C2S-Scale team

We encourage all users to take advantage of these resources and get involved in the C2S-Scale community.

Post Views: 0

Teaching Machines the Language of Biology: Scaling Large Language…