Synthetic and Federated: Revolutionizing Privacy-Preserving Domain...

July 24, 2025

In the rapidly evolving landscape of artificial intelligence, the success of machine learning models hinges on both the scale and quality of data. The paradigm of pre-training on massive web data and post-training on smaller, high-quality data has proven instrumental in enhancing both large and small language models (LMs). For instance, post-training of small models has led to significant improvements, such as a 3%–13% boost in key production metrics for mobile typing applications like Gboard. However, this approach raises critical privacy concerns, particularly the risk of memorizing sensitive user instruction data. Enter privacy-preserving synthetic data, a game-changer in federated learning that minimizes these risks while improving model performance.

At AI News Daily, we’re thrilled to share our exploration over the past few years on generating and using synthetic data to enhance LMs for mobile typing applications. This blog post delves into our approaches, adhering to the privacy principles of data minimization and data anonymization, and showcases their real-world impact on both small and large models in Gboard. We’ll also address common user questions and provide insights into the latest research advances.

Understanding Privacy-Preserving Synthetic Data

Privacy-preserving synthetic data is a powerful tool in federated learning, allowing us to access user interaction data to improve models while systematically minimizing privacy risks. With the generation capabilities of large language models (LLMs), we can create synthetic data that mimics user data without the risk of memorization. This synthetic data can then be used in model training just as public data is used, simplifying privacy-preserving model training.

How Gboard Leverages Synthetic Data

Gboard, Google’s keyboard app, uses both small LMs and LLMs to enhance the typing experience for billions of users. Small LMs support core features like slide to type, next word prediction (NWP), smart compose, smart completion, and suggestion. LLMs, on the other hand, power advanced features like proofreading. Our recent paper, “Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications,” discusses the advances in privacy-preserving synthetic data for LLMs in production.

Learning from Public and Private Data

Our 2024 blog post highlighted best practices for privacy-preserving training on user data to adapt small LMs to the domain of mobile typing text. Federated learning (FL) with differential privacy (DP) ensures that user data stored on one’s own device has minimal exposure during training and is not memorized by the trained models. Pre-training on web data improves the performance of private post-training, empowering the deployment of user-level DP in production.

Federated Learning with Differential Privacy

In today’s context, user data generated in applications is considered private data, while accessible web data and models trained on them are public information. Gboard employs a privacy defense-in-depth strategy to mitigate concerns about potential information leakage in public data. All Gboard production LMs trained on user data use FL with DP guarantees, including key decoder models and the 2024 NWP models. This milestone was achieved by launching dozens of new LMs trained with federated learning and differential privacy (DP-FL LMs) and replacing all older FL-only models.

Recent Research Advances

Since 2024, research advances have continued rapidly. We’ve introduced a new DP algorithm, BLT-DP-FTRL, which offers strong privacy-utility trade-offs and ease-of-use in deployment. Additionally, we’ve adopted the SI-CIFG model architecture for efficient on-device training and compatibility with DP. Furthermore, we use synthetic data from LLMs to improve pre-training, demonstrating the dedication to privacy-preserving learning to enhance both small and large LMs.

Synthetic Data via Public LLMs

In our paper, “Prompt Public Large Language Models to Synthesize Data for Private On-device Applications,” we describe our use of synthetic data for pre-training small LMs that are later post-trained with DP and FL. We employ powerful LLMs trained on publicly accessible data to synthesize high-quality, domain-specific data that resembles user typing data without accessing any private user data. This approach involves carefully designed prompts to instruct LLMs to filter large public datasets and select text characteristic of user typing data.

Real-World Impact

Our methods have made a real-world impact in mobile typing applications. By adhering to privacy principles and leveraging synthetic data, we’ve achieved substantial user benefits and improved LLMs in mobile typing applications. The dedication to privacy-preserving learning has not only enhanced user experiences but has also bridged the gap between small and large models through synthetic data.

Case Study: Gboard’s Proofreading Feature

One notable example is Gboard’s proofreading feature, powered by LLMs. This feature uses synthetic data to correct errors in real-time, providing users with a seamless and accurate typing experience. The synthetic data ensures that the model can learn from a diverse range of text without compromising user privacy.

FAQs

How does synthetic data improve model performance?

Synthetic data improves model performance by providing high-quality, domain-specific data that mimics user interaction data. This data is generated using LLMs trained on publicly accessible data, ensuring that the models can learn from a diverse range of text without accessing any private user data.

What are the privacy risks associated with traditional LM training?

Traditional LM training systems pose potential privacy risks, such as the memorization of sensitive user instruction data. Privacy-preserving synthetic data provides a path to access user interaction data to improve models while systematically minimizing these risks.

How does federated learning with differential privacy work?

Federated learning with differential privacy (DP-FL) ensures that user data stored on one’s own device has minimal exposure during training and is not memorized by the trained models. This approach empowers the deployment of user-level DP in production, improving the performance of private post-training.

What are the benefits of using synthetic data in mobile typing applications?

Using synthetic data in mobile typing applications offers several benefits, including improved model performance, enhanced user experiences, and the ability to bridge the gap between small and large models. Additionally, synthetic data ensures that models can learn from a diverse range of text without compromising user privacy.

Conclusion

In conclusion, privacy-preserving synthetic data in federated learning is a transformative approach that improves both small and large language models while minimizing privacy risks. At AI News Daily, we’re committed to staying at the forefront of AI advancements, and we’re excited to share our exploration and insights into this exciting field. As we continue to innovate, we’ll keep you updated on the latest developments and their real-world applications.

Stay tuned for more AI news and updates daily on AI News Daily. Don’t forget to check out our AI tools and products reviews and stay informed about the latest AI crime news.

Post Views: 0

Synthetic and Federated: Revolutionizing Privacy-Preserving Domain…