AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation
- URL: http://arxiv.org/abs/2305.13050v1
- Date: Mon, 22 May 2023 14:02:44 GMT
- Title: AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation
- Authors: Guy Yariv, Itai Gat, Lior Wolf, Yossi Adi, Idan Schwartz
- Abstract summary: We propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations.
- Score: 89.63430567887718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, image generation has shown a great leap in performance,
where diffusion models play a central role. Although generating high-quality
images, such models are mainly conditioned on textual descriptions. This begs
the question: "how can we adopt such models to be conditioned on other
modalities?". In this paper, we propose a novel method utilizing latent
diffusion models trained for text-to-image-generation to generate images
conditioned on audio recordings. Using a pre-trained audio encoding model, the
proposed method encodes audio into a new token, which can be considered as an
adaptation layer between the audio and text representations. Such a modeling
paradigm requires a small number of trainable parameters, making the proposed
approach appealing for lightweight optimization. Results suggest the proposed
method is superior to the evaluated baseline methods, considering objective and
subjective metrics. Code and samples are available at:
https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.
Related papers
- BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval [3.347768376390811]
We investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance.
Our approach achieves a substantial performance gain over the previous state-of-the-art by leveraging strong pretrained models, a prompting mechanism and a bifurcated design.
arXiv Detail & Related papers (2024-08-19T19:56:10Z) - SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models [21.669044026456557]
We propose a method to enable audio-conditioning in large scale image diffusion models.
In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods.
arXiv Detail & Related papers (2024-05-01T21:43:57Z) - AdaDiff: Adaptive Step Selection for Fast Diffusion [88.8198344514677]
We introduce AdaDiff, a framework designed to learn instance-specific step usage policies.
AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function.
Our approach achieves similar results in terms of visual quality compared to the baseline using a fixed 50 denoising steps.
arXiv Detail & Related papers (2023-11-24T11:20:38Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z) - TransFusion: Transcribing Speech with Multinomial Diffusion [20.165433724198937]
We propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features.
We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark.
We also propose new techniques for effectively sampling and decoding multinomial diffusion models.
arXiv Detail & Related papers (2022-10-14T10:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.