LISA: Localized Image Stylization with Audio via Implicit Neural
Representation
- URL: http://arxiv.org/abs/2211.11381v1
- Date: Mon, 21 Nov 2022 11:51:48 GMT
- Title: LISA: Localized Image Stylization with Audio via Implicit Neural
Representation
- Authors: Seung Hyun Lee, Chanyoung Kim, Wonmin Byeon, Sang Ho Yoon, Jinkyu Kim,
Sangpil Kim
- Abstract summary: We present a novel framework, Localized Image Stylization with Audio (LISA)
LISA performs audio-driven localized image stylization.
We show that the proposed framework outperforms the other audio-guided stylization methods.
- Score: 17.672008998994816
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a novel framework, Localized Image Stylization with Audio (LISA)
which performs audio-driven localized image stylization. Sound often provides
information about the specific context of the scene and is closely related to a
certain part of the scene or object. However, existing image stylization works
have focused on stylizing the entire image using an image or text input.
Stylizing a particular part of the image based on audio input is natural but
challenging. In this work, we propose a framework that a user provides an audio
input to localize the sound source in the input image and another for locally
stylizing the target object or scene. LISA first produces a delicate
localization map with an audio-visual localization network by leveraging CLIP
embedding space. We then utilize implicit neural representation (INR) along
with the predicted localization map to stylize the target object or scene based
on sound information. The proposed INR can manipulate the localized pixel
values to be semantically consistent with the provided audio input. Through a
series of experiments, we show that the proposed framework outperforms the
other audio-guided stylization methods. Moreover, LISA constructs concise
localization maps and naturally manipulates the target object or scene in
accordance with the given audio input.
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Can CLIP Help Sound Source Localization? [19.370071553914954]
We introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder.
By directly using these embeddings, our method generates audio-grounded masks for the provided audio.
Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects.
arXiv Detail & Related papers (2023-11-07T15:26:57Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.