Collaborative Watermarking for Adversarial Speech Synthesis
- URL: http://arxiv.org/abs/2309.15224v2
- Date: Tue, 2 Jan 2024 09:32:23 GMT
- Title: Collaborative Watermarking for Adversarial Speech Synthesis
- Authors: Lauri Juvela (Aalto University, Finland) and Xin Wang (National
Institute of Informatics, Japan)
- Abstract summary: We propose a collaborative training scheme for synthetic speech watermarking.
We show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in neural speech synthesis have brought us technology that is not
only close to human naturalness, but is also capable of instant voice cloning
with little data, and is highly accessible with pre-trained models available.
Naturally, the potential flood of generated content raises the need for
synthetic speech detection and watermarking. Recently, considerable research
effort in synthetic speech detection has been related to the Automatic Speaker
Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on
passive countermeasures. This paper takes a complementary view to generated
speech detection: a synthesis system should make an active effort to watermark
the generated speech in a way that aids detection by another machine, but
remains transparent to a human listener. We propose a collaborative training
scheme for synthetic speech watermarking and show that a HiFi-GAN neural
vocoder collaborating with the ASVspoof 2021 baseline countermeasure models
consistently improves detection performance over conventional classifier
training. Furthermore, we demonstrate how collaborative training can be paired
with augmentation strategies for added robustness against noise and
time-stretching. Finally, listening tests demonstrate that collaborative
training has little adverse effect on perceptual quality of vocoded speech.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - DeepSpeech models show Human-like Performance and Processing of Cochlear Implant Inputs [12.234206036041218]
We use the deep neural network (DNN) DeepSpeech2 as a paradigm to investigate how natural input and cochlear implant-based inputs are processed over time.
We generate naturalistic and cochlear implant-like inputs from spoken sentences and test the similarity of model performance to human performance.
We find that dynamics over time in each layer are affected by context as well as input type.
arXiv Detail & Related papers (2024-07-30T04:32:27Z) - Proactive Detection of Voice Cloning with Localized Watermarking [50.13539630769929]
We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech.
AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level.
AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T18:56:22Z) - Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning [3.5032870024762386]
This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech.
The approach involved finetuning a multi-speaker TTS model to work with child speech.
We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
arXiv Detail & Related papers (2023-11-07T19:31:44Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations.
The proposed approach can be implemented based on off-the-shelf speaker verification tools.
We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Synthesizing Speech from Intracranial Depth Electrodes using an
Encoder-Decoder Framework [1.623136488969658]
Speech Neuroprostheses have the potential to enable communication for people with dysarthria or anarthria.
Recent advances have demonstrated high-quality text decoding and speech synthesis from electrocorticographic grids placed on the cortical surface.
arXiv Detail & Related papers (2021-11-02T09:43:21Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.