Two numbers from VaultGemma

Why differentially-private model releases need pre-registered evaluation protocols

James Dreben · 2026-05-19 · revised 2026-05-21 · earlier versions

In September 2025, Google released VaultGemma 1B, a from-scratch differentially-private large language model. The technical report says the model was trained with DP-SGD to satisfy ε ≤ 2.0 and δ = 1.1e-10 at the sequence level (1024-token sequences, zeroing-out adjacency, PLD accountant). Under Google’s discoverable-extraction memorization test (prompt the model with 50 tokens from a training document, check whether it produces the next 50 tokens, repeat across roughly a million prompts), they reported no detectable memorization.

A workshop paper out of the University of Illinois (Diwan, Wang, Alabi, TPDP 2026) ran a different protocol. They constructed adversarial prompts against about 15,000 well-specified, nontrivial, frequent Pile sequences and reported 7.6% exact and 12.6% approximate memorization. They also ran a 200-query untargeted PII probe and recovered externally verifiable real personal information in 2 of those 200 queries. The authors are explicit that the PII finding is suggestive rather than calibrated memorization evidence — confirming memorization would require training-data access they do not have.

The two numbers, 0% memorization from Google’s discoverable-extraction test and 7.6% memorization from Illinois’s adversarial-extraction test, describe the same model. Both can be right because the protocols are not measuring the same thing. Discoverable extraction with uniform sampling is consistent with VaultGemma’s per-record DP claim; adversarial extraction tests a strictly stronger empirical claim, and the workshop paper itself frames its result as evidence that DP-SGD reduces but does not eliminate memorization in that worst-case sense. Neither result falsifies the formal ε ≤ 2 guarantee. The disagreement is about which empirical claim a publisher considers binding, and no shared standard exists against which either party committed in advance.

What is missing is pre-specification. Before the model was released, there was no public protocol saying which empirical test would carry evidentiary weight.

This post is not really about VaultGemma. It is about a missing piece in the way the field discloses differentially-private AI.

The claim is narrow. DP model releases should separate three things that are often blurred together: the formal privacy claim, the empirical tests the publisher specified before seeing results, and later audits by outsiders. Pre-registration is a way to make that separation durable.

The practical payoff is better comparison. When a later memorization result appears, readers can see whether it tested a protocol the publisher had already said would matter.

The protocol-selection risk

Clinical trials have a familiar version of this. Picture a trial in which the researcher gets to choose, after looking at the data, which endpoint to report. “We had twelve outcome measures, four showed improvement, here are those four.” This is the structure that pre-registration regimes exist to prevent. The FDA requires it. Reputable medical journals require it. Experimental psychology has been moving toward it for the better part of a decade. You commit to the protocol before you look at the data, or your result does not count.

The DP-in-AI analogue is direct. A publisher who runs a memorization test, sees a clean result, and reports it has not done anything wrong, but they also have not made a binding commitment. An outside auditor who runs a different test and gets a different result has also not done anything wrong. The disagreement is not adjudicable. Press releases duel press releases.

That is the gap this post is about: not bad faith, and not a shortage of clever privacy tests, but the absence of a shared record saying which empirical tests were specified in advance.

A short orientation for readers who don’t live in differential privacy

Skip this if you do.

Differential privacy is a mathematical guarantee about how much any one person’s data can change the behaviour of an algorithm that ran on a dataset including that person. The parameter ε (epsilon) controls how much; smaller is stricter. There is no context-free good value, but as rough calibration, ε ≤ 1 is usually treated as strong while much larger values require careful justification. The U.S. Census 2020 used ε ≈ 19.6, which was widely criticised as too weak. DP comes from a precise theorem about a randomised algorithm. A trained model is the output of that algorithm.

For ε to mean anything externally, it has to come with the rest of the machinery that makes it interpretable. Which unit is being protected (a person, a document, a 1024-token sequence)? What counts as “two databases differing in one record” (the neighbouring relation)? How was the privacy loss tracked across training steps (the accountant)? What part of training does the guarantee cover (the boundary)? VaultGemma is unusually careful about most of this. Many DP releases are not.

Beyond the formal DP claim, publishers often also report empirical privacy results: “we ran this memorization test and found nothing.” These are useful. They sometimes catch implementation bugs the math cannot. They are also slippery. There is no shared standard for which test to run, what counts as a finding, or which protocols a publisher considers binding.

Where the auditing literature fits

The DP literature has spent the last several years adding new auditing techniques. Steinke, Nasr, and Jagielski’s Privacy Auditing with One (1) Training Run (NeurIPS 2023) showed that empirical lower bounds on ε can be extracted from a single DP-SGD training run. Panda, Tang, Nasr, Choquette-Choo, and Mittal’s Privacy Auditing of Large Language Models (ICLR 2025) sharpened canary-based membership inference for LLMs specifically. Cebere, Even, Bleistein, and Bellet’s zero-run work (2026) is starting to open the post-hoc-observational setting in which the auditor has no training-time access at all.

These are real advances. None of them answers the question of which test the publisher actually committed to. Until that question has an answer, even good audits are easy to talk past, and the public has no stable way to distinguish a careful DP release from a careless one.

The proposal: pre-registration

The structured-disclosure piece of this is not new. OpenDP’s Deployment Registry already collects standardized DP claims and parameters, and NIST SP 800-226 lays out the disclosure expectations behind that work. What pre-registration adds, on top of that infrastructure, is a binding empirical-protocols section: a record of which tests the publisher considered meaningful evidence of their privacy claim before seeing results.

At release time, the publisher signs a structured disclosure that includes three things.

First, the formal DP claim, with all the parameters that make it interpretable.

Second, the implementation details required to check the math: clipping norm, sampling rate, gradient-normalization convention. (Wang et al. in a recent preprint argue that the last one is exactly where most DP-SGD implementations mismatch their analyses, which is a separate failure mode but one the receipt would surface.)

Third, one or more empirical protocols the publisher considers binding tests of named privacy claims, each with full metadata: threat model, query budget, decision threshold, and, where the protocol supports it, the statistical method that maps audit output to a privacy bound.

An outside auditor who runs a pre-registered protocol emits a second signed document, a probe receipt, that hash-binds to the publisher’s disclosure. A verifier renders the chain.

The verifier never says “passed.” For a memorization or extraction probe, it says things like “claim readable; no probe report attached” or “leakage evidence at protocol P under published claim ε ≤ 2.0.” For a canary or one-run audit with a stated audit theorem, it can say “lower bound ε̂ ≥ 1.4 ± 0.2 against published claim ε ≤ 2.0.” The two result classes are not interchangeable. Readers judge.

The mechanism is technique-agnostic. Steinke’s one-run audit becomes a pre-registerable protocol for pre-release auditing when the publisher cooperates at training time. Panda’s LLM canary becomes a pre-registerable protocol for DP-fine-tuning releases that allow training-time canary insertion. Both require publisher cooperation; neither is a post-hoc external audit. New techniques register as they mature. The framework does not compete with the auditing literature. It gives that literature the missing binding instrument.

What this looks like on VaultGemma

I worked through a model-release receipt for VaultGemma under the proposed schema, using only public materials. The companion note is here. Three things came out of it.

Six disclosure-grade fields are missing from VaultGemma’s public materials: the L2 clipping norm, the gradient-normalization convention, the Poisson sampling rate, the pinned version of Google’s DP-accounting library, the HuggingFace revision hash for the released weights, and a version-pinning field for any probe surface. None of these are research-grade asks. None require changing how VaultGemma was trained. All of them prevent an outside expert from checking the accounting today.

One protocol is honestly pre-registerable from the public materials: VaultGemma’s own discoverable-extraction test. Pre-registered as leakage-evidence, not formal-audit-lower-bound. Memorization probes are not DP audits. They catch some kinds of bad behaviour, but they do not produce a bound on ε. The receipt makes that distinction explicit instead of letting it slide.

No mature post-hoc external auditing protocol exists for from-scratch DP pretraining. Steinke’s one-run audit and Panda’s LLM canary both require training-time cooperation: the auditor needs to be able to insert audit examples before training begins. Cebere’s zero-run work is research-track. This is not a defect of VaultGemma’s disclosure. It is a property of the field as it stands in May 2026. The receipt’s job is to make that gap visible, not paper over it.

The practical implication is straightforward. If VaultGemma’s authors had pre-registered an adversarial-extraction protocol alongside their own discoverable-extraction one, the Illinois workshop result would now be a binding piece of evidence against a broader “no detectable memorization” claim under the adversarial protocol. Both numbers would still be true, but neither side could dismiss the other on protocol-choice grounds. There would be a durable, shared standard for the field to build on.

That is what pre-registration buys.

What this is not

A few clarifications, because DP is a field where careless framing gets pushed back on (correctly) very fast.

This is not a certification scheme. A signed receipt records what the publisher claimed; it does not say the claim is true. It is not a privacy proof. Canary and one-run probe receipts can produce empirical lower bounds on ε under stated audit theorems. Extraction and memorization probe receipts produce leakage evidence, not bounds. Neither produces an upper bound on ε; only the formal analysis does that. It is not a privacy score, because the verifier displays a chain rather than a number. It is not a substitute for accounting review. The receipt makes the inputs to a competent accountant legible. The accountant still has to do the work.

It is also not a redefinition of “private.” The formal DP claim remains a claim about a randomised algorithm. The empirical protocols produce evidence about that claim, never a substitute for it.

What I’d want to hear from you

This is a draft. I am preparing to circulate it more widely, and the framing has to survive serious scrutiny before that. Specific things I would find useful pushback on.

Is “pre-registration” the right name? It signals correctly to anyone familiar with clinical trials or experimental psychology. It might land less well with engineers who associate “registration” with a different set of things. Alternatives I considered while drafting: protocol binding, declared-protocol disclosure. I am not committed.

Is the auditing literature mature enough to make pre-registration informative today? The pre-release operating point is well served by Steinke 2023 and Panda 2025. The external post-hoc operating point on from-scratch DP pretraining is largely empty. If that operating point stays empty, the proposal degrades to “structured disclosure with a memorization-probe slot,” which is useful but smaller than I’d like.

What would actually get a publisher to sign one of these? Regulatory pressure, peer convention, downstream-deployer demand, the example of OpenDP’s Deployment Registry filling up over time? Some combination? Something else?

Where am I wrong about the literature? I’ve been re-engaging with DP after roughly a decade away from academia, and I expect to be wrong about parts of what’s current. The reading list is in the proposal’s sources section.

The full proposal: Pre-registered empirical privacy for DP model releases. The VaultGemma completion: VaultGemma receipt completion under the pre-registration proposal.

Reachable on Mastodon at @Jdreben@mastodon.world. I want this engaged with, not endorsed.

Updates

2026-05-21: revisions for clarity and to add credit to OpenDP and NIST. The originally published version is preserved under earlier versions.