Pre-registered empirical privacy for DP model releases
A proposal.
Summary
Differentially-private model releases (DP-SGD-trained LLMs, DP synthetic datasets, downstream models trained on DP synthetic data) make two kinds of privacy claim. The formal claim is a theorem. Under stated assumptions, the training mechanism satisfies (ε, δ)-DP. The empirical claim is whatever the publisher says about memorization, extraction, or membership-inference resistance, usually accompanied by a chosen protocol and a result.
The formal claim has tooling: Opacus, dp-accounting, JAX
Privacy, the RDP/PRV/GDP accountants. The empirical claim does not.
Today, any publisher can report whatever empirical privacy result they
want, against whatever protocol they want, with whatever query budget
they want. An outside auditor running a different protocol produces a
different number, and there is no structural way to say which one bound
the claim.
This is not hypothetical. Google reported no detectable memorization on VaultGemma’s discoverable-extraction test (Sinha et al., 2025). A workshop analysis (Diwan, Wang, Alabi, 2026) ran a different prompting strategy and reported 7.6% exact and 12.6% approximate memorization on 15k targeted Pile sequences, plus externally verifiable PII in 1% of 200-query untargeted cases. The two parties disagree, and no shared protocol exists against which either can be held to account.
The proposal is pre-registered empirical privacy. At release time, the publisher signs a receipt that declares the empirical protocols under which their claim is meant to be falsifiable, with full protocol metadata. Outside auditors who run those protocols emit signed probe receipts that bind by hash to the publisher receipt. The verifier renders the chain.
This does not certify DP. It does not invent a new auditing technique. It makes the existing techniques produce numbers the publisher cannot dismiss post-hoc as “you used the wrong test.”
The problem, stated precisely
Three failure modes recur across DP model releases.
The first is disclosure incompleteness. Many published privacy claims do not contain enough information to evaluate them. “DP-trained, ε ≤ 2” is not a privacy claim without privacy unit, neighboring relation, accounting method, and training boundary. The OpenDP Deployment Registry (Nanayakkara, Ghazi, Vadhan, 2025) addresses this for aggregate releases by specifying a deployment-card schema. Model releases are not covered.
The second is protocol shopping. When an empirical privacy result is published without a pre-committed protocol, the publisher chooses the protocol that gives the favourable result, and the auditor chooses the protocol that exposes the unfavourable one. Neither is a binding statement. This is structurally identical to the pre-registration problem that clinical trials and experimental psychology faced. The field’s response was to require protocol declaration before the experiment is run.
The third is accounting opacity. DP-SGD accounting depends on implementation choices that are usually buried in code or prose: noisy-clipped-gradient normalization (expected vs. sampled batch size), Poisson vs. fixed-batch sampling with truncation, padding policy, gradient accumulation semantics. Wang et al. (2026) argue that common analyses mismatch common implementations on these exact points. A receipt that does not surface these fields is not auditable even by a cooperating outside expert.
Problems one and three are addressed by a structured receipt schema. Problem two is the new contribution and the focus of this proposal.
The pre-registration mechanism
A pre-registered empirical privacy claim has three artifacts.
A. Publisher receipt (signed at release time)
A model-release receipt extending the OpenDP Deployment Registry deployment-card schema. Required sections.
subject: model identity, artifact digest for open weights, release channel, publisher.claim: dp_definition, epsilon, delta, privacy_unit (with explicit taxonomy: user / document / sequence / event), neighboring_relation, claim_boundary (pretraining / fine_tuning / post_training / synthetic_generation / downstream_training / chain).data_boundary: data source description, unit construction, contribution bound or null-with-reason, user mapping.mechanism: mechanism_type (e.g., dp_sgd), clipping_norm, noise_multiplier, sampling_model, batch_handling, gradient_normalization, sampling_rate or fields sufficient to derive it, steps, gradient_accumulation semantics.accounting: accountant family and library and version, subsampling_amplification_assumption, theorem or method citation, composition_scope, delta_rationale, rounding/reporting policy.pre_registered_protocols: the new section. Zero or more protocol declarations, each containing all fields needed to reproduce the audit (described below).probe_surface: surface_type (open_weights / deterministic_api / stochastic_api / synthetic_dataset / downstream_model / none), version_pinning, rate_limits, randomness_controls, logging_or_policy_constraints.signature: canonical serialization, signature algorithm, verifying key, publisher key identity, signature bytes.
A reference receipt profile draft lives in the modelreceipt project’s working docs (private repository). The VaultGemma completion exercise demonstrates the schema on a real release.
B. Pre-registered protocol declarations
Each declaration in pre_registered_protocols is a signed
commitment that the publisher considers the protocol an acceptable
falsification surface for their claim. Required fields.
protocol_id: canonical identifier (e.g.,panda-icrl-2025-llm-canary,steinke-neurips-2023-one-run,vaultgemma-discoverable-extraction).protocol_version: stable version of the protocol specification.threat_model: attacker capabilities, what they know, what they observe.attacker_knowledge: training data access, candidate-set construction.sample_construction: how the audit examples are produced. For canary protocols, where canaries come from and how they are inserted or filtered.query_budget: maximum number of queries the auditor may issue.decision_threshold: what counts as falsification. For canary inference: TPR at a target FPR, or empirical-ε lower bound above a threshold.lower_bound_method: how raw audit outputs map to an empirical lower bound on ε (e.g., the Steinke-Jagielski tight-auditing analysis, the Jagielski-Ullman-Oprea hypothesis-test bound).acceptable_score_functions: which membership-inference statistics the publisher will accept (loss, logit margin, perplexity-ratio, gradient-norm proxy).excluded_post_processing: any decoding parameters, system prompts, retrieval layers, or safety filters whose absence is required for the audit to bind to the training mechanism rather than the deployed pipeline.expected_lower_bound: optional. The publisher’s own prediction of what this protocol should produce against their model. A pre-declared expectation the auditor can falsify.
A publisher may pre-register multiple protocols, at least one of
which should be appropriate to each disclosed probe surface. A receipt
with probe_surface.surface_type = open_weights but no
pre_registered_protocols appropriate to open-weight access
is, by construction, not probe-ready.
C. Probe receipts (signed by the auditor)
When an outsider runs a pre-registered protocol, they emit a second signed receipt with:
audits_receipt: hash of the publisher’s receipt being audited. This is the binding link. It is signed by the auditor.protocol_idandprotocol_version: must match apre_registered_protocolsentry in the publisher receipt by content, not just by name.protocol_deviations: any deviation from the pre-registered protocol. Auditors who deviate make their deviation legible. The verifier can refuse to treat the result as bound.execution_record: query count, timestamps, randomness seeds where applicable, environment.result: one offormal-audit-lower-bound,leakage-evidence,inconclusive,not-applicable. Theformal-audit-lower-boundoutput is the only one that maps to an empirical ε. The others are evidence objects with carefully scoped meaning.lower_bound_value: when applicable, the empirical-ε lower bound and its confidence interval, computed via the protocol’s statedlower_bound_method.auditor_signature.
The verifier displays the chain: publisher receipt, pre-registered protocols, the set of probe receipts that bind to it. The verifier never says “passed.” It says “claim readable; no probe report attached,” or “leakage evidence attached at protocol P, lower bound 1.4 ± 0.2 against claim ε ≤ 2.0,” and lets the reader judge.
Why this composes with current auditing work
The proposal does not replace any existing auditing technique. It provides the binding instrument that lets those techniques produce externally credible numbers.
| Auditing technique | Role under pre-registration |
|---|---|
| Jagielski, Ullman, Oprea (NeurIPS 2020) | Canary-based audit; available as a registerable protocol for fine-tuning audits. |
| Nasr et al. (USENIX 2023) | Tight auditing; pre-registerable for publisher-side audits with controlled access. |
| Steinke, Nasr, Jagielski (NeurIPS 2023) | One-run audit; the canonical pre-release pre-registerable protocol. |
| Panda, Tang, Nasr, Choquette-Choo, Mittal (ICLR 2025) | LLM canary; the canonical external pre-registerable protocol for open-weight DP-LLMs. |
| Cebere, Even, Bleistein, Bellet (arXiv 2605.14591, 2026) | Zero-run observational; research track until confounding-correction assumptions stabilize enough to bind. |
| Discoverable extraction (VaultGemma technical report, 2025) | Practical leakage probe; pre-registerable as a
leakage-evidence protocol, not a
formal-audit-lower-bound. |
| González et al. (NeurIPS 2025), sequential auditing | Pre-registerable when the publisher commits to the sequential design and stopping rule. |
The framework is technique-agnostic. As new auditing methods mature, they become available as registerable protocols. Older protocols remain in the registry, and publishers who registered them remain bound to them.
What this is not
A signed receipt does not say the formal DP claim is true. It says the publisher committed to it in a way an outside party can hold them to. That is the distinction between binding disclosure and certification, and the receipt sits firmly on the disclosure side.
A probe report with result: leakage-evidence is not a DP
violation. A probe report with
result: formal-audit-lower-bound is a lower bound, not a
refutation of the publisher’s upper bound, unless the two are
inconsistent.
The verifier displays a chain, not a number. There is no “privacy score.” A coverage list (which probe surfaces have pre-registered protocols, and which do not) is the closest the verifier comes to summarising, and that list is qualitative.
The receipt does not replace accounting review. It structures the inputs to a competent accountant so the review is possible. The accountant still has to do the work.
The mechanism is not a denial-of-service vector for probe surfaces.
Rate limits, randomness controls, and query budgets are receipt fields.
A publisher who cannot tolerate any external querying declares
probe_surface.surface_type = none and registers no
protocols. The verifier then displays “no external probe available,”
which is itself useful disclosure.
The proposal is not a redefinition of differential privacy.
Pre-registration is a protocol about empirical evidence regarding a DP
claim. The formal claim remains a claim about a randomized algorithm.
The fields dp_definition,
neighboring_relation, and privacy_unit are
first-class on the receipt precisely so that the empirical evidence is
interpreted against the right claim.
Anticipated objections
Each objection is paired with the kind of reader most likely to raise it. The pairings are illustrative, not predictive.
“You are letting ‘private’ mean something other than DP.”
The receipt’s claim section is the formal DP claim. The
pre-registered protocols produce evidence about that claim, not
a substitute for it. The verifier never outputs “private.” The
result field of a probe receipt is one of
formal-audit-lower-bound, leakage-evidence,
inconclusive, or not-applicable. Empirical
lower bounds are reported as lower bounds with confidence intervals.
Leakage evidence is reported as evidence, not as violation.
The construction that protects this boundary is the requirement that
every empirical claim in a probe receipt cites the protocol’s
lower_bound_method, the specific theorem or technique
mapping audit output to a statistical statement about DP. A probe
receipt with no lower_bound_method may carry leakage-evidence, but it
cannot carry a formal-audit-lower-bound.
“This duplicates or competes with the OpenDP Deployment Registry.”
It is an extension, not a sibling. The subject,
claim, data_boundary, and
accounting sections inherit from the registry schema. The
mechanism section extends it with DP-SGD-specific fields
the aggregate-release registry does not need. The new sections are
pre_registered_protocols and probe_surface.
The natural upstream home for the extension is
opendp/deployments-registry-data. The modelreceipt
project’s role is the extension proposal and the reference verifier.
“Where does this fit in the SP 800-226 evaluation pyramid?”
The receipt maps to the pyramid layer-for-layer. subject
is artifact identity. data_boundary is data collection.
privacy_unit and neighboring_relation are unit
of privacy. mechanism is algorithms and correctness.
accounting is privacy parameters.
probe_surface and pre_registered_protocols
together are the auditing layer the framework’s §2.7 calls for at the
model-release operating point. The same way an aggregate-release
verifier addresses the data-release- system audit gap at the
scalar-statistic operating point, this proposal addresses it at the
model-release operating point.
“Publishers will manipulate the pre-registration.”
Several manipulation surfaces are worth considering up front.
A publisher could pre-register a protocol no one can actually run (say, by specifying private data the outside auditor cannot access). Mitigation: the verifier requires the protocol’s input artifacts to be public, derivable from the model artifact itself, or producible from a publicly described construction.
A publisher could pre-register a protocol with a query budget too
small to find anything. Mitigation: the verifier surfaces
query_budget prominently. Readers can judge. The receipt
cannot fully solve this problem, but it makes the budget legible, which
is exactly the kind of fact a competent reader can evaluate.
A publisher could pre-register many weak protocols and omit the
strong ones. Mitigation: the verifier lists the protocols present and
absent. Consensus reference protocols
(panda-iclr-2025-llm-canary,
steinke-neurips-2023-one-run) become socially expected, and
their absence is conspicuous.
The proposal does not eliminate adversarial behaviour by publishers. It makes adversarial behaviour legible.
“Pre-registering a protocol does not fix the audit’s statistical properties.”
True. The pre-registration mechanism is orthogonal to whether any
particular audit produces a tight lower bound. The receipt-encoded
lower_bound_method is the audit’s own statistical
machinery. The pre-registration only protects against post-hoc protocol
shopping. A weak protocol pre-registered is still weak. The receipt
makes this legible by recording the protocol’s expected operating point
and forcing the auditor’s result to be reported with its CI.
“No publisher will sign one of these.”
The hardest objection. Two responses.
First, the marginal cost is small. Google already publishes the technical substrate (the VaultGemma report). The marginal cost of pre-registering one empirical protocol is a signed paragraph identifying the protocol, its threat model, and the lower-bound method.
Second, adoption is a downstream question. The contribution of this proposal is to specify what would constitute a binding disclosure. Whether any specific publisher adopts it is a separate matter, dependent on regulator attention, peer pressure, and downstream- deployer demand (the same demand that drove the OpenDP Registry to about 40 entries in its first year). The proposal succeeds intellectually if it specifies the right thing, even if broad adoption requires a forcing function.
What would falsify the proposal
This is the discipline check.
1. No available audit protocol can be both pre-registered and informative for the relevant operating points.
Two operating points need protocol coverage, and the available techniques cover them unevenly.
Pre-release (publisher binds before training). Steinke 2023’s one-run audit and Panda 2025’s LLM canary both apply, both require training-time inclusion-randomization of audit examples, and both produce empirical-ε lower bounds with stated statistical analyses. This operating point has mature pre-registerable protocols today.
External post-hoc (auditor has only the released artifact). Today, the only protocols that do not require training-time cooperation are discoverable extraction (memorization probes) and post-hoc membership inference on documents the auditor can prove were in the training corpus. The first is leakage-evidence, not a formal-audit-lower-bound. The second is weaker than canary insertion and confounded by distribution shift between member and non-member documents. The Cebere 2026 zero-run line attempts to correct for the confounding, but its assumptions are not yet mature enough to bind. The honest reading is that no formal-audit-lower-bound protocol exists for the pure external post-hoc setting on a from-scratch DP pretraining release.
This unevenness is itself informative about the proposal’s value. A
VaultGemma receipt today, completed honestly, would say: probe surface =
open weights; pre-registered protocols available for external auditing =
discoverable extraction only, as leakage-evidence. That
disclosure makes the gap legible rather than papering over it.
Publishers who want external formal-audit-lower-bound
coverage must either cooperate with a third-party auditor in a
pre-release setting that supports canary insertion, or wait for the
post-hoc-auditing literature (Cebere et al. and follow-ons) to produce
protocols with stable confounding-correction assumptions.
The falsification test is therefore narrower than “run Panda on VaultGemma.” It is: complete a VaultGemma receipt against the proposal, confirm that the available pre-registerable protocols correctly describe the operating envelope, and confirm that what the receipt cannot claim is itself useful disclosure. That is feasible without training-time access. The pre-release operating point can be validated independently on a synthetic DP-fine-tuning case where the audit team controls the training run.
2. The receipt is so heavy that no publisher would adopt even informally.
Counter-test: take the VaultGemma technical report and complete a
receipt. If the gaps are minor (artifact digest, clipping norm) the
receipt is tractable. If the gaps are structural (no contribution bound
possible, no neighboring relation declared), the receipt exposes a real
disclosure problem in the underlying release, which is the right
outcome. The companion VaultGemma
completion is the first pass at this test. Result: VaultGemma’s
public materials yield a claim-readable receipt today, with
named gaps. That is acceptable. The receipt’s purpose is to make the
gaps namable.
3. The pre-registration framing is rejected by the DP community on conceptual grounds.
This is the most important falsification path and the reason this document exists as a draft for community review. Reading and circulation come first. Code follows the conceptual frame, not the other way around.
Open questions
Several questions remain genuinely open.
Chain receipts. A DP-SGD generator → DP synthetic dataset → downstream model trained on synthetic data is a chain. Each stage has different receipt requirements. How does pre-registration apply at chain boundaries? The likely answer: each stage has its own receipt, each binds by hash to upstream, the verifier renders the chain. The composition assumptions become first-class fields. Pre-registered protocols at downstream stages may only test downstream behaviour, not the composed guarantee.
Unbounded-corpus pretraining. VaultGemma has no meaningful
contribution bound because the population is “the web.” Sequence- level
DP is the publisher’s response. The receipt must allow this case without
papering over it. A contribution_bound: null paired with a
structured null_reason and a prominent user-mapping
disclosure (user_mapping: not_user_level) is the candidate
treatment. The verifier should make sequence-level vs. user-level
legible to a non-expert.
Multi-protocol coverage. Should the verifier compute a coverage score (“this receipt pre-registers protocols covering the open_weights surface but not the synthetic_dataset surface”)? The risk is that a coverage score becomes the new privacy score. Surfacing coverage as a list, not a number, is the candidate mitigation.
Protocol versioning under research drift. When
panda-iclr-2025-llm-canary ships v2 with a stronger canary,
do pre-registrations against v1 remain binding? Proposed answer: yes. v1
commitments bind to v1. The publisher may re-register v2 against the
same release. The registry of protocols is append-only.
Single-org pre-registration audits. Can a publisher pre-register and run their own audit? Yes, but the probe receipt is then a publisher-side audit report, not an external attestation. The verifier displays the auditor’s identity. Readers judge independence.
Hosted-API binding. A hosted-API release whose backend model swaps
silently breaks every probe receipt. The receipt’s
subject.artifact_digest can be null with reason
hosted_api, but probe_surface.version_pinning
must specify how the auditor knows they are probing the same artifact
(deployment id, API version header, publisher-signed assertion of
stability). This is a usability question more than a definitional
one.
Sources
- Sinha et al., “VaultGemma: A Differentially Private Gemma Model” (arXiv:2510.15001).
- Google Research, “VaultGemma: The world’s most capable differentially private LLM” (blog post, 2025).
- Diwan, Wang, Alabi, “Extractable Memorization of Differentially Private Large Language Model” (TPDP 2026 workshop note).
- Abadi et al., “Deep Learning with Differential Privacy” (CCS 2016).
- Mironov, “Rényi Differential Privacy” (CSF 2017).
- Jagielski, Ullman, Oprea, “Auditing Differentially Private Machine Learning” (NeurIPS 2020).
- Nasr, Hayes, Steinke, Balle, Tramèr, Jagielski, Carlini, Terzis, “Tight Auditing of Differentially Private Machine Learning” (USENIX Security 2023).
- Steinke, Nasr, Jagielski, “Privacy Auditing with One (1) Training Run” (arXiv:2305.08846).
- Panda, Tang, Nasr, Choquette-Choo, Mittal, “Privacy Auditing of Large Language Models” (arXiv:2503.06808).
- Cebere, Even, Bleistein, Bellet, “Privacy Auditing with Zero (0) Training Run” (arXiv:2605.14591).
- González et al., “Sequentially Auditing Differential Privacy” (arXiv:2509.07055).
- Wang et al., “Rethinking the Security of DP-SGD: A Corrected Analysis of Differentially Private Machine Learning” (arXiv:2605.15648).
- Nanayakkara, Ghazi, Vadhan, “OpenDP Deployment Registry” (arXiv:2509.13509; registry.opendp.org).
- Near et al., NIST SP 800-226, “Guidelines for Evaluating Differential Privacy Guarantees” (NIST, 2025).
- Stadler, Oprisanu, Troncoso, “Synthetic Data: Anonymisation Groundhog Day” (USENIX Security 2022).
- Carlini et al., “Extracting Training Data from Large Language Models” (USENIX Security 2021).
- Pre-registration as a methodological norm: Nosek et al., “Promoting an Open Research Culture” (Science 2015); the FDA’s clinicaltrials.gov protocol-registration regime as the most mature institutional analogue.