Google Launches VaultGemma: First Privacy-Preserving Large Language Model
Google Research has introduced VaultGemma, its first large language model (LLM) trained with differential privacy to reduce the risk of memorizing sensitive user data. The model is based on the older Gemma 2 architecture and features 1 billion parameters, which the company claims performs comparably to non-private models of similar size. The development addresses concerns about data privacy and potential violations when models are trained on potentially sensitive information.
A team at Google Research has been exploring new techniques to make LLMs less likely to ‘memorize’ any of the content they’re trained on. While LLMs have non-deterministic outputs, they can sometimes regurgitate information from their training data, which can pose privacy risks if sensitive content is included. Differential privacy has been used to prevent such memorization by introducing calibrated noise during the training phase, but it comes with trade-offs in terms of model accuracy and compute requirements.
Until recently, there has been little understanding of how differential privacy affects the scaling laws of AI models. The Google team assumed that model performance would primarily be impacted by the noise-batch ratio, which compares the volume of randomized noise to the size of the training data. By conducting experiments with varying model sizes and noise-batch ratios, the team established a basic understanding of differential privacy scaling laws. This insight helps developers find an optimal noise-batch ratio to ensure a balance between privacy and model quality.
The work has led to the creation of VaultGemma, which is available now on platforms like Hugging Face and Kaggle. Google claims that the model performs comparably to non-private models of similar size, demonstrating that privacy can be integrated without sacrificing performance. This development is significant as tech companies continue to seek ways to balance AI advancement with user privacy protections.