Related papers: You Only Speak Once to See

You Only Speak Once to See

URL: http://arxiv.org/abs/2409.18372v2
Date: Mon, 30 Sep 2024 10:08:15 GMT
Title: You Only Speak Once to See
Authors: Wenhao Yang, Jianguo Wei, Wenhuan Lu, Lei Li,
Abstract summary: Grounding objects in images using visual cues is a well-established approach in computer vision. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes. Experimental results indicate that audio guidance can be effectively applied to object grounding.
Score: 24.889319740761827
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.

Related papers

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation [65.7990140284317]
We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
arXiv Detail & Related papers (2025-11-27T02:00:28Z)
Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding [5.149468267417653]
3D Visual Grounding involves localizing target objects in 3D point clouds based on natural language.<n>We propose Audio-3DVG, a framework that integrates audio and spatial information for enhanced grounding.
arXiv Detail & Related papers (2025-07-01T11:08:22Z)
AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z)
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF. We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z)
Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z)
Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Our idea is to learn to dereverberate speech from audio-visual observations. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z)
Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio) Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
A proto-object based audiovisual saliency map [0.0]
We develop a proto-object based audiovisual saliency map (AVSM) for analysis of dynamic natural scenes. Such environment can be useful in surveillance, robotic navigation, video compression and related applications.
arXiv Detail & Related papers (2020-03-15T08:34:35Z)
Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.