Sound-Guided Semantic Image Manipulation
- URL: http://arxiv.org/abs/2112.00007v1
- Date: Tue, 30 Nov 2021 13:30:12 GMT
- Title: Sound-Guided Semantic Image Manipulation
- Authors: Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young
Kim, Jinkyu Kim, Sangpil Kim
- Abstract summary: We propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space.
Our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification.
The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
- Score: 19.01823634838526
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The recent success of the generative model shows that leveraging the
multi-modal embedding space can manipulate an image using text information.
However, manipulating an image with other sources rather than text, such as
sound, is not easy due to the dynamic characteristics of the sources.
Especially, sound can convey vivid emotions and dynamic expressions of the real
world. Here, we propose a framework that directly encodes sound into the
multi-modal (image-text) embedding space and manipulates an image from the
space. Our audio encoder is trained to produce a latent representation from an
audio input, which is forced to be aligned with image and text representations
in the multi-modal embedding space. We use a direct latent optimization method
based on aligned embeddings for sound-guided image manipulation. We also show
that our method can mix text and audio modalities, which enrich the variety of
the image modification. We verify the effectiveness of our sound-guided image
manipulation quantitatively and qualitatively. We also show that our method can
mix different modalities, i.e., text and audio, which enrich the variety of the
image modification. The experiments on zero-shot audio classification and
semantic-level image classification show that our proposed model outperforms
other text and sound-guided state-of-the-art methods.
Related papers
- An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment [6.977241620071544]
Multimodal large language models have fueled progress in image captioning.
In this work, we show that this ability can be re-purposed for audio captioning.
We introduce a novel methodology for bridging the audiovisual modality gap.
arXiv Detail & Related papers (2024-10-08T12:52:48Z) - SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models [21.669044026456557]
We propose a method to enable audio-conditioning in large scale image diffusion models.
In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods.
arXiv Detail & Related papers (2024-05-01T21:43:57Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Can CLIP Help Sound Source Localization? [19.370071553914954]
We introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder.
By directly using these embeddings, our method generates audio-grounded masks for the provided audio.
Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects.
arXiv Detail & Related papers (2023-11-07T15:26:57Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation [89.63430567887718]
We propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations.
arXiv Detail & Related papers (2023-05-22T14:02:44Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Robust Sound-Guided Image Manipulation [17.672008998994816]
We propose a novel approach that first extends the image-text joint embedding space with sound.
Our experiments show that our sound-guided image manipulation approach produces semantically and visually more plausible manipulation results.
arXiv Detail & Related papers (2022-08-30T09:59:40Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.