Large machine learning (ML) models are ubiquitous in modern applications: from spam filters to recommender systems and virtual assistants. These models achieve remarkable performance partially due to the abundance of available training data. However, these data can sometimes contain private information, including personal identifiable information, copyright material, etc. Therefore, protecting the privacy of the training data is critical to practical, applied ML.
Differential Privacy (DP) is one of the most widely accepted technologies that allows reasoning about data anonymization in a formal way. In the context of an ML model, DP can guarantee that each individual user’s contribution will not result in a significantly different model. A model’s privacy guarantees are characterized by a tuple (ε, δ), where smaller values of both represent stronger DP guarantees and better privacy.
While there are successful examples of protecting training data using DP, obtaining good utility with differentially private ML (DP-ML) techniques can be challenging. First, there are inherent privacy/computation tradeoffs that may limit a model’s utility. Further, DP-ML models often require architectural and hyperparameter tuning, and guidelines on how to do this effectively are limited or difficult to find. Finally, non-rigorous privacy reporting makes it challenging to compare and choose the best DP methods.
In “How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy”, to appear in the Journal of Artificial Intelligence Research, we discuss the current state of DP-ML research. We provide an overview of common techniques for obtaining DP-ML models and discuss research, engineering challenges, mitigation techniques and current open questions. We will present tutorials based on this work at ICML 2023 and KDD 2023.
DP can be introduced during the ML model development process in three places: (1) at the input data level, (2) during training, or (3) at inference. Each option provides privacy protections at different stages of the ML development process, with the weakest being when DP is introduced at the prediction level and the strongest being when introduced at the input level. Making the input data differentially private means that any model that is trained on this data will also have DP guarantees. When introducing DP during the training, only that particular model has DP guarantees. DP at the prediction level means that only the model’s predictions are protected, but the model itself is not differentially private.
The task of introducing DP gets progressively easier from the left to right.
DP is commonly introduced during training (DP-training). Gradient noise injection methods, like DP-SGD or DP-FTRL, and their extensions are currently the most practical methods for achieving DP guarantees in complex models like large deep neural networks.
DP-SGD builds off of the stochastic gradient descent (SGD) optimizer with two modifications: (1) per-example gradients are clipped to a certain norm to limit sensitivity (the influence of an individual example on the overall model), which is a slow and computationally intensive process, and (2) a noisy gradient update is formed by taking aggregated gradients and adding noise that is proportional to the sensitivity and the strength of privacy guarantees.
DP-SGD is a modification of SGD that involves a) clipping per-example gradients to limit the sensitivity and b) adding the noise, calibrated to the sensitivity and privacy guarantees, to the aggregated gradients, before the gradient update step.
Existing DP-training challenges
Gradient noise injection methods usually exhibit: (1) loss of utility, (2) slower training, and (3) an increased memory footprint.
Loss of utility:
The best method for reducing utility drop is to use more computation. Using larger batch sizes and/or more iterations is one of the most prominent and practical ways of improving a model’s performance. Hyperparameter tuning is also extremely important but often overlooked. The utility of DP-trained models is sensitive to the total amount of noise added, which depends on hyperparameters, like the clipping norm and batch size. Additionally, other hyperparameters like the learning rate should be re-tuned to account for noisy gradient updates.
Another option is to obtain more data or use public data of similar distribution. This can be done by leveraging publicly available checkpoints, like ResNet or T5, and fine-tuning them using private data.
Most gradient noise injection methods limit sensitivity via clipping per-example gradients, considerably slowing down backpropagation. This can be addressed by choosing an efficient DP framework that efficiently implements per-example clipping.
Increased memory footprint:
DP-training requires significant memory for computing and storing per-example gradients. Additionally, it requires significantly larger batches to obtain better utility. Increasing the computation resources (e.g., the number and size of accelerators) is the simplest solution for extra memory requirements. Alternatively, several works advocate for gradient accumulation where smaller batches are combined to simulate a larger batch before the gradient update is applied. Further, some algorithms (e.g., ghost clipping, which is based on this paper) avoid per-example gradient clipping altogether.
The following best practices can attain rigorous DP guarantees with the best model utility possible.
Choosing the right privacy unit:
First, we should be clear about a model’s privacy guarantees. This is encoded by selecting the “privacy unit,” which represents the neighboring dataset concept (i.e., datasets where only one row is different). Example-level protection is a common choice in the research literature, but may not be ideal, however, for user-generated data if individual users contributed multiple records to the training dataset. For such a case, user-level protection might be more appropriate. For text and sequence data, the choice of the unit is harder since in most applications individual training examples are not aligned to the semantic meaning embedded in the text.
Choosing privacy guarantees:
We outline three broad tiers of privacy guarantees and encourage practitioners to choose the lowest possible tier below:
Tier 1 — Strong privacy guarantees: Choosing ε ≤ 1 provides a strong privacy guarantee, but frequently results in a significant utility drop for large models and thus may only be feasible for smaller models.
Tier 2 — Reasonable privacy guarantees: We advocate for the currently undocumented, but still widely used, goal for DP-ML models to achieve an ε ≤ 10.
Tier 3 — Weak privacy guarantees: Any finite ε is an improvement over a model with no formal privacy guarantee. However, for ε > 10, the DP guarantee alone cannot be taken as sufficient evidence of data anonymization, and additional measures (e.g., empirical privacy auditing) may be necessary to ensure the model protects user data.
Choosing hyperparameters requires optimizing over three inter-dependent objectives: 1) model utility, 2) privacy cost ε, and 3) computation cost. Common strategies take two of the three as constraints, and focus on optimizing the third. We provide methods that will maximize the utility with a limited number of trials, e.g., tuning with privacy and computation constraints.
Reporting privacy guarantees:
A lot of works on DP for ML report only ε and possibly δ values for their training procedure. However, we believe that practitioners should provide a comprehensive overview of model guarantees that includes:
DP setting: Are the results assuming central DP with a trusted service provider, local DP, or some other setting?
Instantiating the DP definition:
Data accesses covered: Whether the DP guarantee applies (only) to a single training run or also covers hyperparameter tuning etc.
Final mechanism’s output: What is covered by the privacy guarantees and can be released publicly (e.g., model checkpoints, the full sequence of privatized gradients, etc.)
Unit of privacy: The selected “privacy unit” (example-level, user-level, etc.)
Adjacency definition for DP “neighboring” datasets: A description of how neighboring datasets differ (e.g., add-or-remove, replace-one, zero-out-one).
Privacy accounting details: Providing accounting details, e.g., composition and amplification, are important for proper comparison between methods and should include:
Type of accounting used, e.g., Rényi DP-based accounting, PLD accounting, etc.
Accounting assumptions and whether they hold (e.g., Poisson sampling was assumed for privacy amplification but data shuffling was used in training).
Formal DP statement for the model and tuning process (e.g., the specific ε, δ-DP or ρ-zCDP values).
Transparency and verifiability: When possible, complete open-source code using standard DP libraries for the key mechanism implementation and accounting components.
Paying attention to all the components used:
Usually, DP-training is a straightforward application of DP-SGD or other algorithms. However, some components or losses that are often used in ML models (e.g., contrastive losses, graph neural network layers) should be examined to ensure privacy guarantees are not violated.
While DP-ML is an active research area, we highlight the broad areas where there is room for improvement.
Developing better accounting methods:
Our current understanding of DP-training ε, δ guarantees relies on a number of techniques, like Rényi DP composition and privacy amplification. We believe that better accounting methods for existing algorithms will demonstrate that DP guarantees for ML models are actually better than expected.
Developing better algorithms:
The computational burden of using gradient noise injection for DP-training comes from the need to use larger batches and limit per-example sensitivity. Developing methods that can use smaller batches or identifying other ways (apart from per-example clipping) to limit the sensitivity would be a breakthrough for DP-ML.
Better optimization techniques:
Directly applying the same DP-SGD recipe is believed to be suboptimal for adaptive optimizers because the noise added to privatize the gradient may accumulate in learning rate computation. Designing theoretically grounded DP adaptive optimizers remains an active research topic. Another potential direction is to better understand the surface of DP loss, since for standard (non-DP) ML models flatter regions have been shown to generalize better.
Identifying architectures that are more robust to noise:
There’s an opportunity to better understand whether we need to adjust the architecture of an existing model when introducing DP.
Our survey paper summarizes the current research related to making ML models DP, and provides practical tips on how to achieve the best privacy-utility trade offs. Our hope is that this work will serve as a reference point for the practitioners who want to effectively apply DP to complex ML models.
We thank Hussein Hazimeh, Zheng Xu , Carson Denison , H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien and Abhradeep Thakurta, Badih Ghazi, Chiyuan Zhang for the help preparing this blog post, paper and tutorials content. Thanks to John Guilyard for creating the graphics in this post, and Ravi Kumar for comments.