Related papers: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

URL: http://arxiv.org/abs/2406.05629v1
Date: Sun, 9 Jun 2024 03:38:21 GMT
Title: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Authors: Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman,
Abstract summary: We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
Score: 77.33458847943528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: \href{https://aka.ms/denseav}{https://aka.ms/denseav}

Related papers

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data. In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z)
AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z)
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech. We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z)
Audio-Visual Segmentation with Semantics [45.5917563087477]
We propose a new problem called audio-visual segmentation (AVS) The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. We construct the first audio-visual segmentation benchmark, AVSBench, providing pixel-wise annotations for sounding objects in audible videos.
arXiv Detail & Related papers (2023-01-30T18:53:32Z)
Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS) The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z)
Localizing Visual Sounds the Hard Way [149.84890978170174]
We train the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. We introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset.
arXiv Detail & Related papers (2021-04-06T17:38:18Z)
Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs) We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs. We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z)
Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.