Pre-registered empirical privacy for DP model releases

James Dreben · 2026-05-19 · revised 2026-05-21 · earlier versions · Draft for community review

This is the proposal behind Two numbers from VaultGemma. It is intentionally smaller than a full specification. The point is to separate three things that are easy to blur together:

the formal differential-privacy claim;
the empirical tests the publisher specified before seeing results;
later audits or probes run by outside parties.

The claim is not that pre-registration proves privacy. It does not. The claim is that pre-registration makes empirical privacy evidence easier to interpret.

The disclosure gap

Differentially-private model releases usually make two kinds of claim.

The first is formal: under stated assumptions, the training mechanism satisfies some version of (ε, δ)-DP. That claim needs the privacy unit, neighboring relation, mechanism, accountant, and composition boundary. Without those details, an epsilon value is not very meaningful.

The second is empirical: the publisher ran some memorization, extraction, or membership-inference test and reports what happened. Those results can be useful. They can also be hard to compare if the protocol was chosen after seeing the model.

VaultGemma is the motivating example. Google reported no detectable memorization under its discoverable-extraction test. Diwan, Wang, and Alabi later reported memorization under a more adversarial extraction protocol. Both results may be true. The missing piece is a public record of which empirical protocols the publisher considered binding in advance.

The proposal

At release time, the publisher signs a structured receipt with three sections.

1. Formal claim. The receipt records the DP definition, epsilon, delta, privacy unit, neighboring relation, mechanism, accountant, and claim boundary.

2. Reproducibility details. For DP-SGD releases, this includes implementation details such as clipping norm, sampling model, sampling rate, gradient-normalization convention, accountant library, and pinned artifact version. These are not exotic requests. They are the facts an outside expert needs before checking the accounting.

3. Pre-specified empirical protocols. The publisher declares one or more protocols it considers meaningful evidence for named privacy claims. Each protocol should state the threat model, sample construction, query budget, decision threshold, and, where applicable, the statistical method that maps audit output to an empirical privacy bound.

An auditor who runs a declared protocol emits a second signed document: a probe receipt. It identifies the publisher receipt, states the protocol version, records deviations, and reports the result class: formal-audit-lower-bound, leakage-evidence, inconclusive, or not-applicable.

The verifier displays the chain. It should not say “passed.” It should say what was claimed, what protocol was run, what evidence was attached, and what assumptions limit the result.

How this relates to current auditing work

This proposal does not compete with privacy-auditing research. It gives that research a disclosure instrument.

Steinke, Nasr, and Jagielski’s one-run audit is a candidate pre-release protocol when the publisher cooperates before training.
Panda, Tang, Nasr, Choquette-Choo, and Mittal’s LLM canary work is a candidate protocol for settings where canaries can be inserted before training or fine-tuning.
Discoverable extraction is useful leakage evidence, but it is not a formal DP audit and should not be described as one.
Zero-run and post-hoc observational audits are promising, but still depend on assumptions that need to be stated plainly before they can bind a public claim.

The important distinction is result class. A memorization finding may matter a lot, but it is not automatically a formal DP violation.

What this is not

This is not a certification scheme. A signed receipt records what the publisher claimed; it does not make the claim true.

It is not a privacy proof. Empirical audits can produce evidence and, for some protocols, empirical lower-bound estimates. They cannot produce the publisher’s formal upper bound.

It is not a privacy score. A coverage summary may be useful, but it should remain a list of surfaces and protocols, not a single number.

It is not a substitute for accounting review. The receipt makes the inputs legible. The review still has to be done.

Open questions

Several questions remain open.

Name. “Pre-registration” is clear to readers familiar with clinical trials or experimental psychology. It may sound wrong to engineers. “Protocol binding” or “declared-protocol disclosure” may be better.

Post-hoc auditing. The strongest formal audits for DP-SGD still need training-time cooperation. For from-scratch DP pretraining with only released weights available, the external post-hoc operating point is much thinner.

Adoption. The mechanism is useful if even one publisher signs a receipt. Broader uptake would require pressure from deployers, reviewers, regulators, or peer convention.

Synthetic-data chains. DP-SGD generator → synthetic dataset → downstream model is a chain of claims. Each stage likely needs its own receipt and its own assumptions.

Sources

Sinha et al., “VaultGemma: A Differentially Private Gemma Model” (arXiv:2510.15001).
Google Research, “VaultGemma: The world’s most capable differentially private LLM” (blog post, 2025).
Diwan, Wang, Alabi, “Extractable Memorization of Differentially Private Large Language Model” (TPDP 2026 workshop note).
Steinke, Nasr, Jagielski, “Privacy Auditing with One (1) Training Run” (arXiv:2305.08846).
Panda, Tang, Nasr, Choquette-Choo, Mittal, “Privacy Auditing of Large Language Models” (arXiv:2503.06808).
Cebere, Even, Bleistein, Bellet, “Privacy Auditing with Zero (0) Training Run” (arXiv:2605.14591).
Wang et al., “Rethinking the Security of DP-SGD: A Corrected Analysis of Differentially Private Machine Learning” (arXiv:2605.15648).
Nanayakkara, Ghazi, Vadhan, “OpenDP Deployment Registry” (arXiv:2509.13509; registry.opendp.org).
Near et al., NIST SP 800-226, “Guidelines for Evaluating Differential Privacy Guarantees” (NIST, 2025).