
January 27, 2026
In the rapidly evolving world of artificial intelligence, language models are at the forefront of innovation. However, a significant gap exists in the public research on scaling laws for multilingual models. Over 50% of AI model users speak non-English languages, yet publicly accessible scaling laws are predominantly focused on English. This imbalance leaves model builders struggling to make data-driven decisions about efficiency, quality, and cost when developing models for non-English languages or specific language mixtures. Shayne Longpre, a Google Cloud Student Researcher, and Sayna Ebrahimi, a Research Scientist at Google DeepMind, have introduced a groundbreaking solution in their upcoming paper, “ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality,” to be presented at ICLR 2026. This paper addresses the critical gap in public research by presenting the largest public multilingual pre-training study to date, spanning 774 training runs across 10M–8B parameter models and covering 400+ languages.
The Need for ATLAS: Bridging the Gap in Multilingual AI Research
The need for ATLAS is evident in the current landscape of AI research. Over 50% of AI model users speak non-English languages, yet publicly accessible scaling laws are overwhelmingly focused on English. This imbalance creates a critical gap in public research, leaving model builders, tasked with serving billions of international and multilingual users, without data-driven guidance for key development decisions about efficiency, quality, and cost when building for non-English languages or with specific language mixtures. In “ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality,” Shayne Longpre and Sayna Ebrahimi aim to address this gap by introducing adaptive transfer scaling laws (ATLAS) for building multilingual models.
Understanding ATLAS: A Single Scaling Law for Multilingual Mixtures
ATLAS is a simple, practical approach to determining optimal model size, data volume, and language mixtures for training. Unlike traditional scaling laws that focus on monolingual settings, ATLAS provides these recommendations for more complex, multilingual environments. It specifically optimizes performance on a target language (e.g., Catalan) by leveraging data from multiple different languages. ATLAS extends these traditional scaling law principles through three components:
Cross-Lingual Transfer Matrix: Identifying Synergistic Language Pairs
The cross-lingual transfer matrix is a crucial component of ATLAS. It identifies which languages are best to train together, enabling practitioners to efficiently balance the mix of languages in training data with model size. For instance, Catalan might include Latin languages like Spanish, Portuguese, and Italian in its training data. This matrix helps in understanding the synergies between languages, which is a capability prior laws did not support.
Scaling Law for Multilingual Models: Efficiently Expanding Model Size and Data
The scaling law provides guidance on efficiently expanding model size and data as the number of supported languages increases. This is particularly important in multilingual environments where the model needs to handle a diverse set of languages. The law helps in determining the optimal scaling trajectories for model size (N) and data size (D), ensuring that the model is neither overburdened nor underutilized.
Rules for Pre-Training vs. Fine-Tuning: Balancing Efficiency and Performance
ATLAS includes rules for deciding when to pre-train a model from scratch versus fine-tuning from a multilingual checkpoint. This decision is crucial for balancing efficiency and performance. Pre-training from scratch can provide a more generalized model, while fine-tuning can leverage existing knowledge, saving compute resources.
The MADLAD-400 Corpus: A Comprehensive Dataset for Multilingual Training
The novel approach of ATLAS is enabled by the MADLAD-400 corpus, a dataset that spans over 750 runs across 400+ languages. This dataset accounts for three distinct data sources:
1. Target Language: The primary language for which the model is being optimized.
2. Similar Transfer Languages: Languages that are empirically shown to be beneficial for the target language. For example, Catalan might include Spanish, Portuguese, and Italian.
3. All Other Languages: The remaining languages in the dataset, which provide additional context and diversity.
This comprehensive dataset allows ATLAS to learn how much each source actually helps or hinders the target language, a capability prior laws did not support.
Evaluation of ATLAS: Consistently Outperforming Prior Work
To evaluate how well ATLAS predicts a model’s performance on new model sizes, varying amounts of training data, or new language mixtures, the researchers used the MADLAD-400 dataset. They measured performance using a vocabulary-insensitive loss across over 750 independent runs in monolingual, bilingual, and massively multilingual settings. The evaluations showed that ATLAS consistently outperforms prior work.
Optimal Scaling Trajectories for Six Languages
For six languages—English (EN), French (FR), Russian (RU), Chinese (ZH), Hindi (HI), and Swahili (SW)—the researchers analyzed how ATLAS predicted the optimal model size (N) and data size (D) should be scaled. The optimal scaling trajectories for these languages revealed two key observations:
1. Similar Curves with Compute-Efficiency Tax: The curves for optimal scaling trajectories look strikingly similar across languages, but training with a multilingual vocabulary or fully multilingual data comes with a compute-efficiency tax—especially for English.
2. Upward Bends for Low-Resource Languages: Low-resource languages show upward bends as they run out of data, and the model struggles to learn from data repetition. ATLAS explicitly models these effects.
Visualizing Optimal Scaling Trajectories
The charts below show the optimal scaling trajectories determined by ATLAS for each language and model type. The lines represent three configurations:
– Solid (monolingual vocab/data): Monolingual vocabulary and data.
– Dashed (multilingual vocab/monolingual data): Multilingual vocabulary with monolingual data.
– Dotted (multilingual vocab/multilingual data): Multilingual vocabulary with multilingual data.
The dotted lines are consistently highest, indicating that training with a full multilingual setup requires slightly more compute for the same quality.
Cross-Lingual Transfer Map: Quantifying Language Synergies
Next, the researchers measured language-to-language synergies and interference at scale, producing a matrix that quantifies how much training on language A helps (or hurts) language B. The results show very intuitive results:
– Norwegian: Helped primarily by Swedish and German.
– Malay: Helped by Indonesian.
– Arabic: Helped by Hebrew.
– English, French, and Spanish: The most widely helpful languages, likely due to the inherent quality, heterogeneity, and quantity of text in these languages found on the web.
The Impact of ATLAS: Practical Guidance for Multilingual Model Builders
The impact of ATLAS is significant for multilingual model builders. It provides practical guidance on how to mix data and train the most effective models to serve languages beyond English. By leveraging ATLAS, practitioners can efficiently balance the mix of languages in training data with model size, ensuring that their models are both efficient and high-performing.
Pros of ATLAS
– Data-Driven Decisions: ATLAS provides data-driven guidance for key development decisions about efficiency, quality, and cost.
– Efficient Training: The cross-lingual transfer matrix and scaling law help in efficiently expanding model size and data.
– Balanced Performance: The rules for pre-training vs. fine-tuning help in balancing efficiency and performance.
– Comprehensive Dataset: The MADLAD-400 corpus provides a comprehensive dataset for multilingual training.
Cons of ATLAS
– Compute-Efficiency Tax: Training with a multilingual vocabulary or fully multilingual data comes with a compute-efficiency tax.
– Complexity: The approach is more complex than traditional scaling laws, requiring a deeper understanding of multilingual training.
Conclusion: The Future of Multilingual AI Models
ATLAS represents a significant step forward in the field of multilingual AI models. By introducing adaptive transfer scaling laws, it provides practical guidance for building efficient and high-performing models that serve languages beyond English. As AI continues to evolve, ATLAS will undoubtedly play a crucial role in shaping the future of multilingual AI.
FAQ: Common Questions About ATLAS
What is ATLAS, and why is it important?
ATLAS is a set of adaptive transfer scaling laws for building multilingual models. It is important because it provides practical guidance for mixing data and training the most effective models to serve languages beyond English. This is crucial for model builders serving billions of international and multilingual users.
How does ATLAS differ from traditional scaling laws?
ATLAS differs from traditional scaling laws in that it provides recommendations for more complex, multilingual environments. It optimizes performance on a target language by leveraging data from multiple different languages. Traditional scaling laws focus on monolingual settings.
What is the MADLAD-400 corpus, and why is it important for ATLAS?
The MADLAD-400 corpus is a comprehensive dataset that spans over 750 runs across 400+ languages. It is important for ATLAS because it enables the law to learn how much each source actually helps or hinders the target language. This capability is not supported by prior laws.
How does ATLAS help in deciding when to pre-train vs. fine-tune a model?
ATLAS includes rules for deciding when to pre-train a model from scratch versus fine-tuning from a multilingual checkpoint. This decision is crucial for balancing efficiency and performance. Pre-training from scratch can provide a more generalized model, while fine-tuning can leverage existing knowledge, saving compute resources.
What are the pros and cons of using ATLAS for multilingual model building?
The pros of using ATLAS include data-driven decisions, efficient training, balanced performance, and a comprehensive dataset. The cons include a compute-efficiency tax and increased complexity.