Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation
- URL: http://arxiv.org/abs/2501.05413v1
- Date: Thu, 09 Jan 2025 18:13:57 GMT
- Title: Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation
- Authors: Darius Petermann, Mahdi M. Kalayeh,
- Abstract summary: Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned.
We propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired.
To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art.
- Score: 6.169364905804677
- License:
- Abstract: Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.
Related papers
- Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment [18.08290178587821]
We propose a method for generating images of visual scenes from diverse in-the-wild sounds.
This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals.
arXiv Detail & Related papers (2024-12-09T05:04:50Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation [89.63430567887718]
We propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations.
arXiv Detail & Related papers (2023-05-22T14:02:44Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.
We also introduce pose modelling in speech2latent for pose controllability.
Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z) - Hypernetworks build Implicit Neural Representations of Sounds [18.28957270390735]
Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering.
Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models.
We introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training.
Our approach reconstructs audio samples with quality comparable to other state
arXiv Detail & Related papers (2023-02-09T22:24:26Z) - Audio-to-Image Cross-Modal Generation [0.0]
Cross-modal representation learning allows to integrate information from different modalities into one representation.
We train variational autoencoders (VAEs) to reconstruct image archetypes from audio data.
Our results suggest that even in the case when the generated images are relatively inconsistent (diverse), features that are critical for proper image classification are preserved.
arXiv Detail & Related papers (2021-09-27T21:25:31Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.