Generating Realistic Images from In-the-wild Sounds
- URL: http://arxiv.org/abs/2309.02405v1
- Date: Tue, 5 Sep 2023 17:36:40 GMT
- Title: Generating Realistic Images from In-the-wild Sounds
- Authors: Taegyeong Lee, Jeonghun Kang, Hyeonyu Kim, Taehwan Kim
- Abstract summary: We propose a novel approach to generate images from in-the-wild sounds.
First, we convert sound into text using audio captioning.
Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound.
- Score: 2.531998650341267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Representing wild sounds as images is an important but challenging task due
to the lack of paired datasets between sound and images and the significant
differences in the characteristics of these two modalities. Previous studies
have focused on generating images from sound in limited categories or music. In
this paper, we propose a novel approach to generate images from in-the-wild
sounds. First, we convert sound into text using audio captioning. Second, we
propose audio attention and sentence attention to represent the rich
characteristics of sound and visualize the sound. Lastly, we propose a direct
sound optimization with CLIPscore and AudioCLIP and generate images with a
diffusion-based model. In experiments, it shows that our model is able to
generate high quality images from wild sounds and outperforms baselines in both
quantitative and qualitative evaluations on wild audio datasets.
Related papers
- SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound [5.999777817331317]
We introduce SEE-2-SOUND, a zero-shot approach that decomposes the task into (1) identifying visual regions of interest; (2) locating these elements in 3D space; (3) generating mono-audio for each; and (4) integrating them into spatial audio.
Using our framework, we demonstrate compelling results for generating spatial audio for high-quality videos, images, and dynamic images from the internet, as well as media generated by learned approaches.
arXiv Detail & Related papers (2024-06-06T22:55:01Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - Learning Visual Styles from Audio-Visual Associations [21.022027778790978]
We present a method for learning visual styles from unlabeled audio-visual data.
Our model learns to manipulate the texture of a scene to match a sound.
We show that audio can be an intuitive representation for manipulating images.
arXiv Detail & Related papers (2022-05-10T17:57:07Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Sound-Guided Semantic Image Manipulation [19.01823634838526]
We propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space.
Our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification.
The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T13:30:12Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.