VaultGemma receipt completion under the pre-registration proposal

James Dreben · 2026-05-19 · Companion to the pre-registration proposal

Companion artifact to the pre-registration proposal.

Purpose

This document executes one of the falsification tests from the pre-registration proposal. The test: complete a real model-release receipt against the proposed schema for the field’s first live DP-LLM and report what the exercise reveals.

Two questions it answers.

Is the receipt schema tractable? Can a non-Google researcher produce a usable receipt instance from VaultGemma’s public materials? If the gaps are minor, the schema is tractable. If the gaps are structural, the schema either exposes a real disclosure problem in the release (acceptable) or asks for facts publishers never publish (a defect in the schema).
Is the pre_registered_protocols mechanism informative? With or without protocol pre-registration, does the receipt for VaultGemma produce useful disclosure about what an outside auditor can and cannot verify?

Short answer: yes to both. The receipt is claim-readable with named structural gaps. The pre-registered protocols section makes the post-hoc external-auditing operating envelope visible in a way the literature has not yet had a place for.

A machine-readable JSON instance of the completed receipt lives in the modelreceipt project’s working repository (private). This document is the human-readable commentary on the same content.

Source material

Sinha et al., “VaultGemma: A Differentially Private Gemma Model” (arXiv:2510.15001).
Google Research, “VaultGemma: The world’s most capable differentially private LLM” (blog post, 2025).
HuggingFace google/vaultgemma-1b model card.
Diwan, Wang, Alabi, “Extractable Memorization of Differentially Private Large Language Model” (TPDP 2026 workshop note).
Wang et al., “Rethinking the Security of DP-SGD: A Corrected Analysis of Differentially Private Machine Learning” (arXiv:2605.15648).

The completed receipt, as read

In YAML for readability. The canonical form is JSON in the project’s working repository.

profile: model-training-receipt-v0

subject:
  artifact_type: weights
  name: VaultGemma 1B
  version: huggingface main revision, unpinned
  artifact_uri: https://huggingface.co/google/vaultgemma-1b
  artifact_digest: null
  publisher: Google
  license: gemma
  model_family: Gemma
  parameter_count: 1B
  architecture_summary: 26-layer decoder-only transformer, MQA,
    1024-token attention span, d_model 1152, vocab 256128
  tokenizer: Gemma/Gemini SentencePiece with split digits, preserved
    whitespace, byte-level encodings, 256K entries
  null_reasons:
    artifact_digest: Technical report does not pin a HuggingFace
      revision. The model card lists weights at unpinned main
      revision. An auditor cannot bind probe results to immutable
      bytes from public materials alone.

claim:
  dp_definition: approximate_dp
  epsilon: 2.0
  delta: 1.1e-10
  privacy_unit:
    kind: sequence
    description: 1024 consecutive tokens drawn from heterogeneous
      sources
  neighboring_relation: zeroing-out adjacency
  claim_boundary: full_training
  protected_artifact: model_weights

data_boundary:
  data_source_description: same data mixture used for Gemma 2; 13T
    primarily English tokens from web documents, code, and science
    articles
  unit_construction: token sequences from heterogeneous sources;
    long documents split into multiple sequences; shorter documents
    may be packed into one sequence
  sequence_length: 1024
  packing_policy: shorter documents may be packed into one sequence
  splitting_policy: long documents are split into multiple sequences
  contribution_bound: null
  user_mapping: not_user_level
  repeated_document_policy: at worst a single document may be sampled
    up to seven times; most source datasets fewer than three times
  null_reasons:
    contribution_bound: Pretraining corpus is the public web. No
      meaningful per-user contribution bound is definable. The
      receipt's claim is sequence-level DP, not user-level DP, and
      the user_mapping field surfaces this explicitly.

mechanism:
  mechanism_type: dp_sgd
  clipping_norm: null
  noise_multiplier: 0.6143481
  sampling_model: truncated_poisson
  batch_size_semantics: expected batch size 517989
  sampling_rate: null
  steps: 100000
  epochs: null
  batch_handling: fixed-size batches via padding or truncation of
    Poisson-sampled batches; padding reported at under 2% of total
    batch size
  gradient_accumulation: 64 independent partial gradients per model
    update; each partial gradient receives calibrated Gaussian noise
    before averaging
  gradient_normalization: null
  implementation_library: JAX Privacy
  null_reasons:
    clipping_norm: Not extracted from public materials. Technical
      report describes vectorized per-example clipping but does not
      state the L2 norm bound. Absence prevents the accounting from
      being independently checkable.
    sampling_rate: Not directly stated. Derivable in principle from
      expected_batch_size / total_sequence_count, but
      total_sequence_count depends on packing and split policy
      applied to the 13T-token corpus.
    gradient_normalization: Not specified. Wang et al. (arXiv
      2605.15648, 2026) argue that common DP-SGD implementations
      mismatch their analyses on this exact field (expected vs.
      sampled batch-size normalization of noisy clipped sums). Until
      disclosed, accounting cannot be independently checked against
      the implementation.

accounting:
  accountant: privacy_loss_distribution
  accountant_library: google_dp_accounting
  accountant_version: null
  theorem_or_method: ABLQ under zeroing-out adjacency
  delta_rationale: sequence-level delta reported as 1.1e-10
  composition_scope: full_training
  subsampling_amplification_assumption: truncated_poisson
  rounding_or_reporting_policy: epsilon reported as <= 2.0
  null_reasons:
    accountant_version: google_dp_accounting library version not
      pinned in the technical report. PLD numerics can vary across
      library versions; independent reproduction requires a pinned
      version.

empirical_evaluation:
  protocol_name: vaultgemma-discoverable-extraction-v1
  threat_model: model prompted with a 50-token prefix sampled from a
    training document; auditor checks whether the model generates
    the corresponding 50-token suffix
  attacker_knowledge: access to 50-token prefixes from a uniform
    subsample of training documents
  query_budget: approximately 1M sampled training examples
  sample_construction: uniform subsample across the 13T-token
    pretraining corpora
  success_metric: exact or approximate generation of the
    corresponding 50-token suffix
  decision_threshold: exact match counts as exact memorization;
    continuation within 10% edit distance counts as approximate
    memorization
  result: no detectable memorization reported by the publisher under
    this protocol
  negative_control: null

pre_registered_protocols:
  - protocol_id: vaultgemma-discoverable-extraction-v1
    protocol_version: "1.0"
    applies_to_probe_surface: open_weights
    result_class: leakage-evidence
    threat_model: 50-token prefix to 50-token suffix continuation
      memorization probe
    attacker_knowledge: uniform-subsample 50-token prefixes from the
      pretraining corpora; assumes auditor has access to documents
      they can demonstrate were in the pretraining mixture
    sample_construction: uniform subsample across the 13T-token
      pretraining corpora; minimum 1M prefix/suffix pairs for a
      binding negative result
    query_budget: 1M model completions
    decision_threshold: exact-match memorization rate above 0, or
      approximate-match (10% edit distance) memorization rate above 0
    lower_bound_method: none. This protocol is a memorization probe,
      not a DP-auditing test. Output is interpreted as leakage
      evidence, not as an empirical lower bound on epsilon.
    excluded_post_processing:
      - safety_filters
      - system_prompts
      - retrieval_augmentation
    expected_publisher_result: no detectable memorization at exact or
      approximate match thresholds

missing_pre_registered_protocols:
  formal_audit_lower_bound_for_open_weights: No mature post-hoc
    external auditing protocol exists today that produces a formal
    lower bound on epsilon for a from-scratch DP pretraining
    release. Available techniques (Steinke 2023 one-run; Panda 2025
    LLM canary) require training-time controlled inclusion of audit
    examples and cannot be executed post-hoc by an external auditor.
    Cebere 2026 zero-run is research-track. Its confounding-
    correction assumptions are not yet stable enough to bind. This
    absence is an honest disclosure, not a defect of the receipt.
  adversarial_extraction: Publisher did not pre-register an
    adversarial extraction protocol. If they had, the Diwan-Wang-
    Alabi TPDP 2026 result (7.6% exact / 12.6% approximate
    memorization on 15k targeted Pile sequences) would be a binding
    falsification under that protocol. Absent pre-registration the
    disagreement remains structurally underdetermined.

probe_surface:
  surface_type: open_weights
  version_pinning: null
  rate_limits: not_applicable
  randomness_controls: local_inference_config_required
  logging_or_policy_constraints: not_applicable
  null_reasons:
    version_pinning: HuggingFace revision not pinned by publisher in
      the technical report. An auditor's probe result cannot bind to
      immutable bytes without a fixed revision hash.

signature: (omitted. This is a researcher-completed receipt, not a
  publisher-signed one. A publisher-signed instance would carry the
  canonical serialization, signature algorithm, verifying key,
  publisher key identity, and signature bytes.)

Section-by-section commentary

Subject

Mostly complete. The only structural gap is artifact_digest, and the gap is informative. VaultGemma is published at unpinned HuggingFace main revision, so an auditor literally cannot bind probe results to immutable bytes. This is not a publisher omission so much as a publisher choice. It is reported as such.

Claim

Complete. VaultGemma’s public story on this section is unusually careful. Privacy unit is named explicitly as “1024 consecutive tokens.” Neighboring relation is named as zeroing-out adjacency. The publisher itself surfaces that user-level DP would be preferable where users can be identified. This is the section the rest of the receipt orients around.

Data boundary

contribution_bound: null is the right answer for a from-scratch pretraining release on the public web. The null is structural. There is no per-user contribution bound to declare because the unit of protection is a sequence, not a user. The user_mapping: not_user_level field surfaces this without papering over it. A non-expert reader of the receipt should be able to see that “sequence-level DP at ε=2 on 1024-token windows” does not mean “person-level DP at ε=2.”

Mechanism

Two of three null fields here are load-bearing.

clipping_norm is required for the accounting to be independently checkable. Its absence in public materials means an outside reviewer who wants to reproduce the privacy calculation cannot do so. The technical report describes per-example clipping but does not state the norm bound. This is a publisher-side disclosure gap.

gradient_normalization is the field Wang et al. (arXiv 2605.15648) target as a load-bearing implementation detail. Their preprint argues that common DP-SGD implementations mismatch their analyses on whether noisy clipped-gradient sums are normalized by expected or sampled batch size. Until VaultGemma’s normalization convention is disclosed, the accounting cannot be independently checked against the implementation. If the Wang et al. critique holds, this field becomes the single most disclosure-critical implementation detail.

sampling_rate is derivable in principle but requires knowing the total sequence count after packing and splitting, which the report does not supply directly.

Accounting

Mostly complete with one informative gap: accountant_version for google_dp_accounting. PLD numerics vary across library versions. Independent reproduction of the accounting requires the pinned version. This is a small disclosure ask with a large reproducibility payoff.

Empirical evaluation

Complete in the sense that the protocol metadata is fully populated. The publisher’s “no detectable memorization” result lives here, and the protocol that produced it is named (vaultgemma-discoverable-extraction-v1). negative_control is null because the publisher did not report a non-private-counterpart memorization baseline under the same protocol. Without one, the strength of the negative result is hard to calibrate.

Pre-registered protocols (the proposal’s contribution)

This is the section that exists because of the pre-registration proposal. For VaultGemma, exactly one protocol is genuinely pre-registerable from the public materials: the discoverable- extraction protocol the publisher has already run. Its result_class is leakage-evidence, not formal-audit-lower-bound. This is the honest classification. Extraction tests are not DP audits.

The companion section, missing_pre_registered_protocols, exists to make the gap legible. Two gaps matter.

First, no formal-audit-lower-bound protocol for external post-hoc auditing of from-scratch DP pretraining. The available formal- auditing techniques (Steinke 2023 one-run; Panda 2025 LLM canary; Jagielski 2020 canary insertion) all require training-time cooperation. An outside auditor with only the released weights cannot run them. Cebere 2026 zero-run is research-track. This is not a defect of VaultGemma’s disclosure. It is the operating envelope of the field as it stands in May 2026.

Second, no pre-registered adversarial-extraction protocol. Google could have pre-registered, alongside their own discoverable- extraction protocol, the kind of adversarial protocol Diwan, Wang, Alabi used in their TPDP 2026 note. If they had, the 7.6% exact / 12.6% approximate memorization that paper reports would be a binding falsification of the “no detectable memorization” claim under the adversarial protocol. Without pre-registration, the disagreement is structurally underdetermined. Two parties ran different tests, neither bound to either party in advance, and there is no shared protocol to adjudicate.

This is the structural problem the pre-registration proposal exists to fix.

Probe surface

surface_type: open_weights is the right classification. The single null is version_pinning, which is the same problem subject.artifact_digest flags. VaultGemma is published at an unpinned HuggingFace revision. An auditor’s probe result cannot bind to immutable bytes without a fixed revision hash.

What the gaps reveal

Three things, in order of importance.

A schema-disclosure gap is not the same as a privacy-engineering gap. VaultGemma is an unusually careful DP release on every layer that matters technically. The privacy unit is named. The neighboring relation is named. The accounting is principled (ABLQ + PLD). The empirical evaluation has a named protocol. What it lacks is the publisher’s discipline of releasing the implementation details and pre-committing to the empirical protocols that should be considered binding. Both are missing. Both are easy to add. Neither requires changing how VaultGemma was actually trained.

Post-hoc external auditing of from-scratch DP pretraining is an open research problem the receipt makes legible. The missing_pre_registered_protocols.formal_audit_lower_bound_for_open_weights field is not a failure of the receipt. It is a structural fact about the operating envelope. The literature on DP-LLM auditing has not yet produced a protocol that an external auditor can run on a released DP-pretrained model and produce a binding lower bound on epsilon. The receipt should expose that absence rather than pretending one exists.

Pre-registration would have changed the VaultGemma vs. Diwan-Wang-Alabi disagreement from a dueling-press-release into a structurally legible exchange. The disagreement is real and important. Google reports no detectable memorization. An external workshop reports 7.6% exact memorization with a more adversarial protocol. Both could be right. They are testing different things. With pre-registration, the publisher could have signed two protocols at release time (the discoverable-extraction one and an adversarial- extraction one), and the workshop result would then be a binding falsification of the adversarial-protocol pre-registration. Without pre-registration, the workshop result is a fair criticism that the publisher can dismiss on protocol-choice grounds, and neither side has a shared standard.

What it would take for VaultGemma to reach `probe-ready` under the proposal

Six fields and one decision. The fields, in order of how much they unblock the accounting.

mechanism.clipping_norm. The L2 norm bound used for per-example clipping. Required to check the noise calibration.
mechanism.gradient_normalization. The denominator used for noisy clipped-gradient sums. Required to check accounting against implementation under the Wang et al. critique.
mechanism.sampling_rate. The per-step Poisson sampling rate. Derivable from expected_batch_size / total_sequence_count if the latter is published.
accounting.accountant_version. Pinned version of google_dp_accounting. Required to reproduce the PLD numerics.
subject.artifact_digest. Pinned HuggingFace revision hash. Required to bind probe results to immutable bytes.
probe_surface.version_pinning. The same revision hash, used by the verifier to bind probe receipts.

The decision: whether to pre-register an adversarial-extraction protocol alongside the discoverable-extraction one. If yes, an adversarial result like Diwan-Wang-Alabi’s becomes structurally adjudicable. If no, that disagreement stays as it is. Either is a defensible choice. The receipt should record it explicitly.

Does this validate the proposal?

The proposal’s first falsification test was: complete a real receipt against the schema for the first live DP-LLM release and confirm that the absence of a formal-audit-lower-bound protocol for external post-hoc auditing is itself useful disclosure.

Result: yes. The completed receipt produces useful disclosure in three independent ways.

It surfaces the six implementation-and-pinning fields VaultGemma’s technical report leaves unspecified, with structural reasons each matters.

It names the post-hoc external-auditing gap as a property of the field, not the publisher.

It reframes the VaultGemma vs. Diwan-Wang-Alabi dispute as a pre-registration failure rather than a methodological dispute.

The schema is tractable. The pre-registration mechanism is informative. The proposal’s first falsification test passes.

The pre-release operating point (Steinke one-run and Panda canary binding under publisher cooperation) is the second falsification test. It is feasible but requires running an actual DP-fine-tuning job with controlled audit-example inclusion. That is a follow-on exercise.