Learning Visual Styles from Audio-Visual Associations
- URL: http://arxiv.org/abs/2205.05072v1
- Date: Tue, 10 May 2022 17:57:07 GMT
- Title: Learning Visual Styles from Audio-Visual Associations
- Authors: Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao
- Abstract summary: We present a method for learning visual styles from unlabeled audio-visual data.
Our model learns to manipulate the texture of a scene to match a sound.
We show that audio can be an intuitive representation for manipulating images.
- Score: 21.022027778790978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: From the patter of rain to the crunch of snow, the sounds we hear often
convey the visual textures that appear within a scene. In this paper, we
present a method for learning visual styles from unlabeled audio-visual data.
Our model learns to manipulate the texture of a scene to match a sound, a
problem we term audio-driven image stylization. Given a dataset of paired
audio-visual data, we learn to modify input images such that, after
manipulation, they are more likely to co-occur with a given input sound. In
quantitative and qualitative evaluations, our sound-based model outperforms
label-based approaches. We also show that audio can be an intuitive
representation for manipulating images, as adjusting a sound's volume or mixing
two sounds together results in predictable changes to visual style. Project
webpage: https://tinglok.netlify.app/files/avstyle
Related papers
- Self-Supervised Audio-Visual Soundscape Stylization [22.734359700809126]
We manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene.
Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures.
We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.
arXiv Detail & Related papers (2024-09-22T06:57:33Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Generating Realistic Images from In-the-wild Sounds [2.531998650341267]
We propose a novel approach to generate images from in-the-wild sounds.
First, we convert sound into text using audio captioning.
Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound.
arXiv Detail & Related papers (2023-09-05T17:36:40Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Sound-Guided Semantic Image Manipulation [19.01823634838526]
We propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space.
Our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification.
The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T13:30:12Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.