TransFusion: Transcribing Speech with Multinomial Diffusion
- URL: http://arxiv.org/abs/2210.07677v1
- Date: Fri, 14 Oct 2022 10:01:43 GMT
- Title: TransFusion: Transcribing Speech with Multinomial Diffusion
- Authors: Matthew Baas, Kevin Eloff, Herman Kamper
- Abstract summary: We propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features.
We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark.
We also propose new techniques for effectively sampling and decoding multinomial diffusion models.
- Score: 20.165433724198937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have shown exceptional scaling properties in the image
synthesis domain, and initial attempts have shown similar benefits for applying
diffusion to unconditional text synthesis. Denoising diffusion models attempt
to iteratively refine a sampled noise signal until it resembles a coherent
signal (such as an image or written sentence). In this work we aim to see
whether the benefits of diffusion models can also be realized for speech
recognition. To this end, we propose a new way to perform speech recognition
using a diffusion model conditioned on pretrained speech features.
Specifically, we propose TransFusion: a transcribing diffusion model which
iteratively denoises a random character sequence into coherent text
corresponding to the transcript of a conditioning utterance. We demonstrate
comparable performance to existing high-performing contrastive models on the
LibriSpeech speech recognition benchmark. To the best of our knowledge, we are
the first to apply denoising diffusion to speech recognition. We also propose
new techniques for effectively sampling and decoding multinomial diffusion
models. These are required because traditional methods of sampling from
acoustic models are not possible with our new discrete diffusion approach. Code
and trained models are available: https://github.com/RF5/transfusion-asr
Related papers
- Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach.
It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.
Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image
Captioning [36.4086473737433]
We propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion.
To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model.
In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network.
arXiv Detail & Related papers (2023-09-10T08:55:24Z) - AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation [89.63430567887718]
We propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations.
arXiv Detail & Related papers (2023-05-22T14:02:44Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models.
We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.