Align, Adapt and Inject: Sound-guided Unified Image Generation
- URL: http://arxiv.org/abs/2306.11504v1
- Date: Tue, 20 Jun 2023 12:50:49 GMT
- Title: Align, Adapt and Inject: Sound-guided Unified Image Generation
- Authors: Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao,
Ping Luo
- Abstract summary: We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
- Score: 50.34667929051005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided image generation has witnessed unprecedented progress due to the
development of diffusion models. Beyond text and image, sound is a vital
element within the sphere of human perception, offering vivid representations
and naturally coinciding with corresponding scenes. Taking advantage of sound
therefore presents a promising avenue for exploration within image generation
research. However, the relationship between audio and image supervision remains
significantly underdeveloped, and the scarcity of related, high-quality
datasets brings further obstacles. In this paper, we propose a unified
framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation,
editing, and stylization. In particular, our method adapts input sound into a
sound token, like an ordinary word, which can plug and play with existing
powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first
train a multi-modal encoder to align audio representation with the pre-trained
textual manifold and visual manifold, respectively. Then, we propose the audio
adapter to adapt audio representation into an audio token enriched with
specific semantics, which can be injected into a frozen T2I model flexibly. In
this way, we are able to extract the dynamic information of varied sounds,
while utilizing the formidable capability of existing T2I models to facilitate
sound-guided image generation, editing, and stylization in a convenient and
cost-effective manner. The experiment results confirm that our proposed AAI
outperforms other text and sound-guided state-of-the-art methods. And our
aligned multi-modal encoder is also competitive with other approaches in the
audio-visual retrieval and audio-text retrieval tasks.
Related papers
- SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models [21.669044026456557]
We propose a method to enable audio-conditioning in large scale image diffusion models.
In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods.
arXiv Detail & Related papers (2024-05-01T21:43:57Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Generating Realistic Images from In-the-wild Sounds [2.531998650341267]
We propose a novel approach to generate images from in-the-wild sounds.
First, we convert sound into text using audio captioning.
Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound.
arXiv Detail & Related papers (2023-09-05T17:36:40Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - Sound-Guided Semantic Image Manipulation [19.01823634838526]
We propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space.
Our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification.
The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T13:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.