Protecting Your Voice: Temporal-aware Robust Watermarking
- URL: http://arxiv.org/abs/2504.14832v2
- Date: Sat, 21 Jun 2025 15:59:07 GMT
- Title: Protecting Your Voice: Temporal-aware Robust Watermarking
- Authors: Yue Li, Weizhi Liu, Dongdong Lin, Hui Tian, Hongxia Wang,
- Abstract summary: textbfunderlinetemporal-aware textbfunderlinerobtextbfunderlineust wattextbfunderlineermarking (emphTrue)<n> integrated content-driven encoder is designed for watermarked waveform reconstruction.<n> temporal-aware gated convolutional network is meticulously designed to bit-wise recover the watermark.
- Score: 10.883912837253794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of generative models has led to the synthesis of real-fake ambiguous voices. To erase the ambiguity, embedding watermarks into the frequency-domain features of synthesized voices has become a common routine. However, the robustness achieved by choosing the frequency domain often comes at the expense of fine-grained voice features, leading to a loss of fidelity. Maximizing the comprehensive learning of time-domain features to enhance fidelity while maintaining robustness, we pioneer a \textbf{\underline{t}}emporal-aware \textbf{\underline{r}}ob\textbf{\underline{u}}st wat\textbf{\underline{e}}rmarking (\emph{True}) method for protecting the speech and singing voice. For this purpose, the integrated content-driven encoder is designed for watermarked waveform reconstruction, which is structurally lightweight. Additionally, the temporal-aware gated convolutional network is meticulously designed to bit-wise recover the watermark. Comprehensive experiments and comparisons with existing state-of-the-art methods have demonstrated the superior fidelity and vigorous robustness of the proposed \textit{True} achieving an average PESQ score of 4.63.
Related papers
- Latent-Mark: An Audio Watermark Robust to Neural Resynthesis [62.09761127079914]
Latent-Mark is the first zero-bit audio watermarking framework designed to survive semantic compression.<n>Our key insight is that robustness to the encode-decode process requires embedding the watermark within the invariant latent space.<n>Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
arXiv Detail & Related papers (2026-03-05T15:51:09Z) - More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles [58.941305935872265]
We show that strong watermarks significantly reduce the entropy of the token distribution.<n>We propose a framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling.
arXiv Detail & Related papers (2026-02-12T10:18:16Z) - VocBulwark: Towards Practical Generative Speech Watermarking via Additional-Parameter Injection [10.244226665349483]
VocBulwark is a framework that freezes generative model parameters to preserve perceptual quality.<n>VocBulwark achieves high-capacity and high-fidelity watermarking, offering robust defense against complex practical scenarios.
arXiv Detail & Related papers (2026-01-30T04:51:50Z) - WaveSeg: Enhancing Segmentation Precision via High-Frequency Prior and Mamba-Driven Spectrum Decomposition [61.3530659856013]
We propose a novel decoder architecture, WaveSeg, which jointly optimize feature refinement in spatial and wavelet domains.<n>High-frequency components are first learned from input images as explicit priors to reinforce boundary details.<n>Experiments on standard benchmarks demonstrate that WaveSeg, leveraging wavelet-domain frequency prior with Mamba-based attention, consistently outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-24T01:41:31Z) - AWARE: Audio Watermarking with Adversarial Resistance to Edits [0.0]
AWARE (Audio Watermarking with Adrial Resistance to Edits) is an approach that avoids reliance on attack-versa stacks and handcrafted differentiable distortions.<n> Embedding is obtained via adversarial optimization in the time-frequency domain under a level-proportional budget.<n>AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits.
arXiv Detail & Related papers (2025-10-20T13:10:52Z) - StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models [55.05404953041403]
We propose a novel framework that seamlessly integrates a binary watermark into the diffusion generation process.<n>We show that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.
arXiv Detail & Related papers (2025-09-22T16:35:19Z) - Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - Video Signature: In-generation Watermarking for Latent Video Diffusion Models [19.648332041264474]
Video Signature (VID SIG) is an in-generation watermarking method for latent video diffusion models.<n>We achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality.<n> Experimental results show that VID SIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency.
arXiv Detail & Related papers (2025-05-31T17:43:54Z) - TriniMark: A Robust Generative Speech Watermarking Method for Trinity-Level Attribution [3.1682080884953736]
We propose a generative textbfspeech wattextbfermarking method (TriniMark) for authenticating the generated content.<n>We first design a structure-lightweight watermark encoder that embeds watermarks into the time-domain features of speech.<n>A temporal-aware gated convolutional network is meticulously designed in the watermark decoder for bit-wise watermark recovery.
arXiv Detail & Related papers (2025-04-29T08:23:28Z) - SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis [8.590034271906289]
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards.<n>Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation.<n>We propose a framework, textittextbfSafeSpeech, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches.
arXiv Detail & Related papers (2025-04-14T03:21:23Z) - XAttnMark: Learning Robust Audio Watermarking with Cross-Attention [15.216472445154064]
Cross-Attention Robust Audio Watermark (XAttnMark)<n>This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges the gap by leveraging partial parameter sharing between the generator and the detector.<n>We propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility.
arXiv Detail & Related papers (2025-02-06T17:15:08Z) - CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions [0.5120567378386615]
We fine-tune the model to produce more verbatim speech transcriptions.
We employ several techniques to increase robustness against multiple speakers and background noise.
arXiv Detail & Related papers (2024-08-29T14:52:42Z) - GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis [37.065509936285466]
This paper proposes the generative robust audio watermarking method (Groot)
In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously.
Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
arXiv Detail & Related papers (2024-07-15T06:57:19Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Proactive Detection of Voice Cloning with Localized Watermarking [50.13539630769929]
We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech.
AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level.
AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T18:56:22Z) - WavMark: Watermarking for Audio Generation [70.65175179548208]
This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet.
The watermark is imperceptible to human senses and exhibits strong resilience against various attacks.
It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection.
arXiv Detail & Related papers (2023-08-24T13:17:35Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Histogram Layer Time Delay Neural Networks for Passive Sonar
Classification [58.720142291102135]
A novel method combines a time delay neural network and histogram layer to incorporate statistical contexts for improved feature learning and underwater acoustic target classification.
The proposed method outperforms the baseline model, demonstrating the utility in incorporating statistical contexts for passive sonar target recognition.
arXiv Detail & Related papers (2023-07-25T19:47:26Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - FastPitch: Parallel Text-to-speech with Pitch Prediction [9.213700601337388]
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech.
The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive.
arXiv Detail & Related papers (2020-06-11T23:23:58Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.