Related papers: SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

URL: http://arxiv.org/abs/2506.05414v1
Date: Wed, 04 Jun 2025 19:11:20 GMT
Title: SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
Authors: Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman,
Abstract summary: 3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition.<n> SAVVY is the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio.
Score: 17.185628958975528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.

Related papers

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction [53.19968902152528]
We present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion.<n>Specifically, our method learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps.<n>We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks.
arXiv Detail & Related papers (2025-04-08T05:33:13Z)
EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z)
Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations [37.61183525419993]
We propose CDScene: Vision-based Robust Semantic Scene Completion via Capturing Dynamic Representations.<n>We leverage a multimodal large-scale model to extract 2D explicit semantics and align them into 3D space.<n>We exploit the characteristics of monocular and stereo depth to decouple scene information into dynamic and static features.
arXiv Detail & Related papers (2025-03-08T13:49:43Z)
3D Audio-Visual Segmentation [44.61476023587931]
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. We propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models. Experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI.
arXiv Detail & Related papers (2024-11-04T16:30:14Z)
Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.<n>voxelization infers per-object occupancy probabilities at individual spatial locations.<n>Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation [44.940531391847]
We address the challenge of dense indoor prediction with sound in 2D and 3D via cross-modal knowledge distillation. We are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. For audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-20T06:07:04Z)
Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation. This framework establishes robust correlations between an object's visual characteristics and its associated sound. We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z)
DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization [67.85434518679382]
We present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene. voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning.
arXiv Detail & Related papers (2023-04-30T05:29:28Z)
Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source. Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.