James Dreben  ·  back to post
Earlier version. This page is the v1 snapshot of this proposal, captured 2026-05-19 (initial publication). The current version may differ. See the versions index for the full edit history.

Pre-registered empirical privacy for DP model releases

James Dreben  ·   ·  Draft for community review

A proposal.

Summary

Differentially-private model releases (DP-SGD-trained LLMs, DP synthetic datasets, downstream models trained on DP synthetic data) make two kinds of privacy claim. The formal claim is a theorem. Under stated assumptions, the training mechanism satisfies (ε, δ)-DP. The empirical claim is whatever the publisher says about memorization, extraction, or membership-inference resistance, usually accompanied by a chosen protocol and a result.

The formal claim has tooling: Opacus, dp-accounting, JAX Privacy, the RDP/PRV/GDP accountants. The empirical claim does not. Today, any publisher can report whatever empirical privacy result they want, against whatever protocol they want, with whatever query budget they want. An outside auditor running a different protocol produces a different number, and there is no structural way to say which one bound the claim.

This is not hypothetical. Google reported no detectable memorization on VaultGemma’s discoverable-extraction test (Sinha et al., 2025). A workshop analysis (Diwan, Wang, Alabi, 2026) ran a different prompting strategy and reported 7.6% exact and 12.6% approximate memorization on 15k targeted Pile sequences, plus externally verifiable PII in 1% of 200-query untargeted cases. The two parties disagree, and no shared protocol exists against which either can be held to account.

The proposal is pre-registered empirical privacy. At release time, the publisher signs a receipt that declares the empirical protocols under which their claim is meant to be falsifiable, with full protocol metadata. Outside auditors who run those protocols emit signed probe receipts that bind by hash to the publisher receipt. The verifier renders the chain.

This does not certify DP. It does not invent a new auditing technique. It makes the existing techniques produce numbers the publisher cannot dismiss post-hoc as “you used the wrong test.”

The problem, stated precisely

Three failure modes recur across DP model releases.

The first is disclosure incompleteness. Many published privacy claims do not contain enough information to evaluate them. “DP-trained, ε ≤ 2” is not a privacy claim without privacy unit, neighboring relation, accounting method, and training boundary. The OpenDP Deployment Registry (Nanayakkara, Ghazi, Vadhan, 2025) addresses this for aggregate releases by specifying a deployment-card schema. Model releases are not covered.

The second is protocol shopping. When an empirical privacy result is published without a pre-committed protocol, the publisher chooses the protocol that gives the favourable result, and the auditor chooses the protocol that exposes the unfavourable one. Neither is a binding statement. This is structurally identical to the pre-registration problem that clinical trials and experimental psychology faced. The field’s response was to require protocol declaration before the experiment is run.

The third is accounting opacity. DP-SGD accounting depends on implementation choices that are usually buried in code or prose: noisy-clipped-gradient normalization (expected vs. sampled batch size), Poisson vs. fixed-batch sampling with truncation, padding policy, gradient accumulation semantics. Wang et al. (2026) argue that common analyses mismatch common implementations on these exact points. A receipt that does not surface these fields is not auditable even by a cooperating outside expert.

Problems one and three are addressed by a structured receipt schema. Problem two is the new contribution and the focus of this proposal.

The pre-registration mechanism

A pre-registered empirical privacy claim has three artifacts.

A. Publisher receipt (signed at release time)

A model-release receipt extending the OpenDP Deployment Registry deployment-card schema. Required sections.

A reference receipt profile draft lives in the modelreceipt project’s working docs (private repository). The VaultGemma completion exercise demonstrates the schema on a real release.

B. Pre-registered protocol declarations

Each declaration in pre_registered_protocols is a signed commitment that the publisher considers the protocol an acceptable falsification surface for their claim. Required fields.

A publisher may pre-register multiple protocols, at least one of which should be appropriate to each disclosed probe surface. A receipt with probe_surface.surface_type = open_weights but no pre_registered_protocols appropriate to open-weight access is, by construction, not probe-ready.

C. Probe receipts (signed by the auditor)

When an outsider runs a pre-registered protocol, they emit a second signed receipt with:

The verifier displays the chain: publisher receipt, pre-registered protocols, the set of probe receipts that bind to it. The verifier never says “passed.” It says “claim readable; no probe report attached,” or “leakage evidence attached at protocol P, lower bound 1.4 ± 0.2 against claim ε ≤ 2.0,” and lets the reader judge.

Why this composes with current auditing work

The proposal does not replace any existing auditing technique. It provides the binding instrument that lets those techniques produce externally credible numbers.

Auditing technique Role under pre-registration
Jagielski, Ullman, Oprea (NeurIPS 2020) Canary-based audit; available as a registerable protocol for fine-tuning audits.
Nasr et al. (USENIX 2023) Tight auditing; pre-registerable for publisher-side audits with controlled access.
Steinke, Nasr, Jagielski (NeurIPS 2023) One-run audit; the canonical pre-release pre-registerable protocol.
Panda, Tang, Nasr, Choquette-Choo, Mittal (ICLR 2025) LLM canary; the canonical external pre-registerable protocol for open-weight DP-LLMs.
Cebere, Even, Bleistein, Bellet (arXiv 2605.14591, 2026) Zero-run observational; research track until confounding-correction assumptions stabilize enough to bind.
Discoverable extraction (VaultGemma technical report, 2025) Practical leakage probe; pre-registerable as a leakage-evidence protocol, not a formal-audit-lower-bound.
González et al. (NeurIPS 2025), sequential auditing Pre-registerable when the publisher commits to the sequential design and stopping rule.

The framework is technique-agnostic. As new auditing methods mature, they become available as registerable protocols. Older protocols remain in the registry, and publishers who registered them remain bound to them.

What this is not

A signed receipt does not say the formal DP claim is true. It says the publisher committed to it in a way an outside party can hold them to. That is the distinction between binding disclosure and certification, and the receipt sits firmly on the disclosure side.

A probe report with result: leakage-evidence is not a DP violation. A probe report with result: formal-audit-lower-bound is a lower bound, not a refutation of the publisher’s upper bound, unless the two are inconsistent.

The verifier displays a chain, not a number. There is no “privacy score.” A coverage list (which probe surfaces have pre-registered protocols, and which do not) is the closest the verifier comes to summarising, and that list is qualitative.

The receipt does not replace accounting review. It structures the inputs to a competent accountant so the review is possible. The accountant still has to do the work.

The mechanism is not a denial-of-service vector for probe surfaces. Rate limits, randomness controls, and query budgets are receipt fields. A publisher who cannot tolerate any external querying declares probe_surface.surface_type = none and registers no protocols. The verifier then displays “no external probe available,” which is itself useful disclosure.

The proposal is not a redefinition of differential privacy. Pre-registration is a protocol about empirical evidence regarding a DP claim. The formal claim remains a claim about a randomized algorithm. The fields dp_definition, neighboring_relation, and privacy_unit are first-class on the receipt precisely so that the empirical evidence is interpreted against the right claim.

Anticipated objections

Each objection is paired with the kind of reader most likely to raise it. The pairings are illustrative, not predictive.

“You are letting ‘private’ mean something other than DP.”

The receipt’s claim section is the formal DP claim. The pre-registered protocols produce evidence about that claim, not a substitute for it. The verifier never outputs “private.” The result field of a probe receipt is one of formal-audit-lower-bound, leakage-evidence, inconclusive, or not-applicable. Empirical lower bounds are reported as lower bounds with confidence intervals. Leakage evidence is reported as evidence, not as violation.

The construction that protects this boundary is the requirement that every empirical claim in a probe receipt cites the protocol’s lower_bound_method, the specific theorem or technique mapping audit output to a statistical statement about DP. A probe receipt with no lower_bound_method may carry leakage-evidence, but it cannot carry a formal-audit-lower-bound.

“This duplicates or competes with the OpenDP Deployment Registry.”

It is an extension, not a sibling. The subject, claim, data_boundary, and accounting sections inherit from the registry schema. The mechanism section extends it with DP-SGD-specific fields the aggregate-release registry does not need. The new sections are pre_registered_protocols and probe_surface. The natural upstream home for the extension is opendp/deployments-registry-data. The modelreceipt project’s role is the extension proposal and the reference verifier.

“Where does this fit in the SP 800-226 evaluation pyramid?”

The receipt maps to the pyramid layer-for-layer. subject is artifact identity. data_boundary is data collection. privacy_unit and neighboring_relation are unit of privacy. mechanism is algorithms and correctness. accounting is privacy parameters. probe_surface and pre_registered_protocols together are the auditing layer the framework’s §2.7 calls for at the model-release operating point. The same way an aggregate-release verifier addresses the data-release- system audit gap at the scalar-statistic operating point, this proposal addresses it at the model-release operating point.

“Publishers will manipulate the pre-registration.”

Several manipulation surfaces are worth considering up front.

A publisher could pre-register a protocol no one can actually run (say, by specifying private data the outside auditor cannot access). Mitigation: the verifier requires the protocol’s input artifacts to be public, derivable from the model artifact itself, or producible from a publicly described construction.

A publisher could pre-register a protocol with a query budget too small to find anything. Mitigation: the verifier surfaces query_budget prominently. Readers can judge. The receipt cannot fully solve this problem, but it makes the budget legible, which is exactly the kind of fact a competent reader can evaluate.

A publisher could pre-register many weak protocols and omit the strong ones. Mitigation: the verifier lists the protocols present and absent. Consensus reference protocols (panda-iclr-2025-llm-canary, steinke-neurips-2023-one-run) become socially expected, and their absence is conspicuous.

The proposal does not eliminate adversarial behaviour by publishers. It makes adversarial behaviour legible.

“Pre-registering a protocol does not fix the audit’s statistical properties.”

True. The pre-registration mechanism is orthogonal to whether any particular audit produces a tight lower bound. The receipt-encoded lower_bound_method is the audit’s own statistical machinery. The pre-registration only protects against post-hoc protocol shopping. A weak protocol pre-registered is still weak. The receipt makes this legible by recording the protocol’s expected operating point and forcing the auditor’s result to be reported with its CI.

“No publisher will sign one of these.”

The hardest objection. Two responses.

First, the marginal cost is small. Google already publishes the technical substrate (the VaultGemma report). The marginal cost of pre-registering one empirical protocol is a signed paragraph identifying the protocol, its threat model, and the lower-bound method.

Second, adoption is a downstream question. The contribution of this proposal is to specify what would constitute a binding disclosure. Whether any specific publisher adopts it is a separate matter, dependent on regulator attention, peer pressure, and downstream- deployer demand (the same demand that drove the OpenDP Registry to about 40 entries in its first year). The proposal succeeds intellectually if it specifies the right thing, even if broad adoption requires a forcing function.

What would falsify the proposal

This is the discipline check.

1. No available audit protocol can be both pre-registered and informative for the relevant operating points.

Two operating points need protocol coverage, and the available techniques cover them unevenly.

Pre-release (publisher binds before training). Steinke 2023’s one-run audit and Panda 2025’s LLM canary both apply, both require training-time inclusion-randomization of audit examples, and both produce empirical-ε lower bounds with stated statistical analyses. This operating point has mature pre-registerable protocols today.

External post-hoc (auditor has only the released artifact). Today, the only protocols that do not require training-time cooperation are discoverable extraction (memorization probes) and post-hoc membership inference on documents the auditor can prove were in the training corpus. The first is leakage-evidence, not a formal-audit-lower-bound. The second is weaker than canary insertion and confounded by distribution shift between member and non-member documents. The Cebere 2026 zero-run line attempts to correct for the confounding, but its assumptions are not yet mature enough to bind. The honest reading is that no formal-audit-lower-bound protocol exists for the pure external post-hoc setting on a from-scratch DP pretraining release.

This unevenness is itself informative about the proposal’s value. A VaultGemma receipt today, completed honestly, would say: probe surface = open weights; pre-registered protocols available for external auditing = discoverable extraction only, as leakage-evidence. That disclosure makes the gap legible rather than papering over it. Publishers who want external formal-audit-lower-bound coverage must either cooperate with a third-party auditor in a pre-release setting that supports canary insertion, or wait for the post-hoc-auditing literature (Cebere et al. and follow-ons) to produce protocols with stable confounding-correction assumptions.

The falsification test is therefore narrower than “run Panda on VaultGemma.” It is: complete a VaultGemma receipt against the proposal, confirm that the available pre-registerable protocols correctly describe the operating envelope, and confirm that what the receipt cannot claim is itself useful disclosure. That is feasible without training-time access. The pre-release operating point can be validated independently on a synthetic DP-fine-tuning case where the audit team controls the training run.

2. The receipt is so heavy that no publisher would adopt even informally.

Counter-test: take the VaultGemma technical report and complete a receipt. If the gaps are minor (artifact digest, clipping norm) the receipt is tractable. If the gaps are structural (no contribution bound possible, no neighboring relation declared), the receipt exposes a real disclosure problem in the underlying release, which is the right outcome. The companion VaultGemma completion is the first pass at this test. Result: VaultGemma’s public materials yield a claim-readable receipt today, with named gaps. That is acceptable. The receipt’s purpose is to make the gaps namable.

3. The pre-registration framing is rejected by the DP community on conceptual grounds.

This is the most important falsification path and the reason this document exists as a draft for community review. Reading and circulation come first. Code follows the conceptual frame, not the other way around.

Open questions

Several questions remain genuinely open.

Chain receipts. A DP-SGD generator → DP synthetic dataset → downstream model trained on synthetic data is a chain. Each stage has different receipt requirements. How does pre-registration apply at chain boundaries? The likely answer: each stage has its own receipt, each binds by hash to upstream, the verifier renders the chain. The composition assumptions become first-class fields. Pre-registered protocols at downstream stages may only test downstream behaviour, not the composed guarantee.

Unbounded-corpus pretraining. VaultGemma has no meaningful contribution bound because the population is “the web.” Sequence- level DP is the publisher’s response. The receipt must allow this case without papering over it. A contribution_bound: null paired with a structured null_reason and a prominent user-mapping disclosure (user_mapping: not_user_level) is the candidate treatment. The verifier should make sequence-level vs. user-level legible to a non-expert.

Multi-protocol coverage. Should the verifier compute a coverage score (“this receipt pre-registers protocols covering the open_weights surface but not the synthetic_dataset surface”)? The risk is that a coverage score becomes the new privacy score. Surfacing coverage as a list, not a number, is the candidate mitigation.

Protocol versioning under research drift. When panda-iclr-2025-llm-canary ships v2 with a stronger canary, do pre-registrations against v1 remain binding? Proposed answer: yes. v1 commitments bind to v1. The publisher may re-register v2 against the same release. The registry of protocols is append-only.

Single-org pre-registration audits. Can a publisher pre-register and run their own audit? Yes, but the probe receipt is then a publisher-side audit report, not an external attestation. The verifier displays the auditor’s identity. Readers judge independence.

Hosted-API binding. A hosted-API release whose backend model swaps silently breaks every probe receipt. The receipt’s subject.artifact_digest can be null with reason hosted_api, but probe_surface.version_pinning must specify how the auditor knows they are probing the same artifact (deployment id, API version header, publisher-signed assertion of stability). This is a usability question more than a definitional one.

Sources