Related papers: GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis

GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis

URL: http://arxiv.org/abs/2407.10471v2
Date: Wed, 17 Jul 2024 05:43:36 GMT
Title: GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
Authors: Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li,
Abstract summary: This paper proposes the generative robust audio watermarking method (Groot) In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously. Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
Score: 37.065509936285466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.

Related papers

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis [62.09761127079914]
Latent-Mark is the first zero-bit audio watermarking framework designed to survive semantic compression.<n>Our key insight is that robustness to the encode-decode process requires embedding the watermark within the invariant latent space.<n>Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
arXiv Detail & Related papers (2026-03-05T15:51:09Z)
SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion [11.934813439152528]
Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and plays a crucial role in AI safety.<n>Existing in-generation approaches are non-blind, requiring maintaining all the message-key pairs and performing template-based matching during extraction.<n>We propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion.
arXiv Detail & Related papers (2026-03-03T11:33:44Z)
VocBulwark: Towards Practical Generative Speech Watermarking via Additional-Parameter Injection [10.244226665349483]
VocBulwark is a framework that freezes generative model parameters to preserve perceptual quality.<n>VocBulwark achieves high-capacity and high-fidelity watermarking, offering robust defense against complex practical scenarios.
arXiv Detail & Related papers (2026-01-30T04:51:50Z)
EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model [56.53617289548353]
EchoGen is a pioneering framework that empowers Visual Auto-Regressive ( VAR) models with subject-driven generation capabilities.<n>We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition.<n>To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models.
arXiv Detail & Related papers (2025-09-30T11:45:48Z)
Optimization-Free Universal Watermark Forgery with Regenerative Diffusion Models [50.73220224678009]
Watermarking can be used to verify the origin of synthetic images generated by artificial intelligence models.<n>Recent studies demonstrate the capability to forge watermarks from a target image onto cover images via adversarial techniques.<n>In this paper, we uncover a greater risk of an optimization-free and universal watermark forgery.<n>Our approach significantly broadens the scope of attacks, presenting a greater challenge to the security of current watermarking techniques.
arXiv Detail & Related papers (2025-06-06T12:08:02Z)
TriniMark: A Robust Generative Speech Watermarking Method for Trinity-Level Attribution [3.1682080884953736]
We propose a generative textbfspeech wattextbfermarking method (TriniMark) for authenticating the generated content. We first design a structure-lightweight watermark encoder that embeds watermarks into the time-domain features of speech. A temporal-aware gated convolutional network is meticulously designed in the watermark decoder for bit-wise watermark recovery.
arXiv Detail & Related papers (2025-04-29T08:23:28Z)
SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation [3.1682080884953736]
We propose generative watermarking method that integrates parameter-efficient fine-tuning with speech watermarking. The proposed method ensures high-fidelity watermarked speech even at a large capacity of 2000 bps. It surpasses other state-of-the-art methods by nearly 23% in resisting time-stretching attacks.
arXiv Detail & Related papers (2025-04-21T11:43:36Z)
Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models [66.54457339638004]
Copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. We propose a diffusion model watermarking method tailored for real-world deployment. Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness.
arXiv Detail & Related papers (2025-04-21T11:18:16Z)
Revealing the Implicit Noise-based Imprint of Generative Models [71.94916898756684]
This paper presents a novel framework that leverages noise-based model-specific imprint for the detection task. By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data. Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
arXiv Detail & Related papers (2025-03-12T12:04:53Z)
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention [15.216472445154064]
Cross-Attention Robust Audio Watermark (XAttnMark) This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges the gap by leveraging partial parameter sharing between the generator and the detector. We propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility.
arXiv Detail & Related papers (2025-02-06T17:15:08Z)
Towards Effective User Attribution for Latent Diffusion Models via Watermark-Informed Blending [54.26862913139299]
We introduce a novel framework Towards Effective user Attribution for latent diffusion models via Watermark-Informed Blending (TEAWIB) TEAWIB incorporates a unique ready-to-use configuration approach that allows seamless integration of user-specific watermarks into generative models. Experiments validate the effectiveness of TEAWIB, showcasing the state-of-the-art performance in perceptual quality and attribution accuracy.
arXiv Detail & Related papers (2024-09-17T07:52:09Z)
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN. We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Invisible Watermarking for Audio Generation Diffusion Models [11.901028740065662]
This paper presents the first watermarking technique applied to audio diffusion models trained on mel-spectrograms. Our model excels not only in benign audio generation, but also incorporates an invisible watermarking trigger mechanism for model verification.
arXiv Detail & Related papers (2023-09-22T20:10:46Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis [19.422230767803246]
We propose Period VITS, a novel end-to-end text-to-speech model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
arXiv Detail & Related papers (2022-10-28T07:52:30Z)
A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis. Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z)
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes. We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection. We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.