Hidden in Plain Tokens

Simply Robust, Gradient-Free Watermark for Synthetic Audio (ICML 2026)

Abstract

As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

Method Overview

Token-level watermarking mechanism in audio domain
Illustration of a token-level watermarking mechanism in the audio domain. During generation, the autoregressive model computes a probability distribution over the vocabulary at each time step. A logit bias is pseudorandomly applied to a specific subset of tokens, encouraging their selection, and the resulting token sequence is synthesized into a waveform by the decoder D. For detection, the waveform is re-encoded by the encoder E to recover the token sequence. The detector then performs a statistical hypothesis test based on the number of biased tokens, returning a p-value representing the probability of random generation. A sufficiently low p-value is evidence towards the alternative hypothesis, meaning that a watermark signal is present.

Method overview for mitigating retokenization errors
Illustration of how our method captures and explicitly mitigates the retokenization errors. First, we use the encoder and decoder modules from the codec of interest to encode, decode, and re-encode a dataset of waveforms (top). We use the confusion counts between token as edge weights in a graph where the vertices correspond to tokens. Then, we perform community detection on that graph, effectively reducing the vocabulary size by a many-to-one mapping from tokens to clusters (bottom). Singleton vertices correspond to tokens that have never been confused in the dataset. Notice that we only require black box access to the codec components.

Audio Samples

Qualitative examples from the main experiments, grouped by model. Each row uses the same sample prompt across methods. The samples are not cherry-picked.

Moshi
Sample None Base WMAR Ours
000
001
002
MusicGen
Sample None Base WMAR Ours
000
001
002

BibTeX

@inproceedings{milis2026hidden,
  title={Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio},
  author={Milis, Georgios and Qin, Yubin and Wu, Yihan and Huang, Heng},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}