Related papers: Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

URL: http://arxiv.org/abs/2511.22025v1
Date: Thu, 27 Nov 2025 02:00:28 GMT
Title: Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation
Authors: Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte,
Abstract summary: We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
Score: 65.7990140284317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.

Related papers

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation [30.42124709340273]
We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation.<n>Our results demonstrate that audio-language pretraining yields competitive, transferable representations.<n>These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations.
arXiv Detail & Related papers (2025-11-20T19:17:35Z)
SeeingSounds: Learning Audio-to-Visual Alignment via Text [15.011814561603964]
We introduce SeeingSounds, a framework for audio-to-image generation that leverages the interplay between audio, language, and vision.<n>Our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model.<n>This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception.
arXiv Detail & Related papers (2025-10-10T18:42:50Z)
Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z)
You Only Speak Once to See [24.889319740761827]
Grounding objects in images using visual cues is a well-established approach in computer vision. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes. Experimental results indicate that audio guidance can be effectively applied to object grounding.
arXiv Detail & Related papers (2024-09-27T01:16:15Z)
Can Textual Semantics Mitigate Sounding Object Segmentation Preference? [10.368382203643739]
We argue that audio lacks robust semantics compared to vision, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance. Our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.
arXiv Detail & Related papers (2024-07-15T17:45:20Z)
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z)
Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform. We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.