Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal
Distillation
- URL: http://arxiv.org/abs/2309.11081v1
- Date: Wed, 20 Sep 2023 06:07:04 GMT
- Title: Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal
Distillation
- Authors: Heeseung Yun, Joonil Na, Gunhee Kim
- Abstract summary: We address the challenge of dense indoor prediction with sound in 2D and 3D via cross-modal knowledge distillation.
We are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations.
For audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance.
- Score: 44.940531391847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound can convey significant information for spatial reasoning in our daily
lives. To endow deep networks with such ability, we address the challenge of
dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge
distillation. In this work, we propose a Spatial Alignment via Matching (SAM)
distillation framework that elicits local correspondence between the two
modalities in vision-to-audio knowledge transfer. SAM integrates audio features
with visually coherent learnable spatial embeddings to resolve inconsistencies
in multiple layers of a student model. Our approach does not rely on a specific
input representation, allowing for flexibility in the input shapes or
dimensions without performance degradation. With a newly curated benchmark
named Dense Auditory Prediction of Surroundings (DAPS), we are the first to
tackle dense indoor prediction of omnidirectional surroundings in both 2D and
3D with audio observations. Specifically, for audio-based depth estimation,
semantic segmentation, and challenging 3D scene reconstruction, the proposed
distillation framework consistently achieves state-of-the-art performance
across various metrics and backbone architectures.
Related papers
- RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion [45.171150395915056]
3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations.
Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations.
We resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC.
arXiv Detail & Related papers (2023-03-24T12:33:44Z) - SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [127.1119359047849]
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments.
It generates highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations.
SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
arXiv Detail & Related papers (2022-06-16T17:17:44Z) - DSGN++: Exploiting Visual-Spatial Relation forStereo-based 3D Detectors [60.88824519770208]
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors.
We revisit the prior stereo modeling DSGN about the stereo volume constructions for representing both 3D geometry and semantics.
We propose our approach, DSGN++, aiming for improving information flow throughout the 2D-to-3D pipeline.
arXiv Detail & Related papers (2022-04-06T18:43:54Z) - Semantic Dense Reconstruction with Consistent Scene Segments [33.0310121044956]
A method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks.
First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone.
A dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence.
arXiv Detail & Related papers (2021-09-30T03:01:17Z) - HIDA: Towards Holistic Indoor Understanding for the Visually Impaired
via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor [25.206941504935685]
HIDA is a lightweight assistive system based on 3D point cloud instance segmentation with a solid-state LiDAR sensor.
Our entire system consists of three hardware components, two interactive functions(obstacle avoidance and object finding) and a voice user interface.
The proposed 3D instance segmentation model has achieved state-of-the-art performance on ScanNet v2 dataset.
arXiv Detail & Related papers (2021-07-07T12:23:53Z) - Stereo Object Matching Network [78.35697025102334]
This paper presents a stereo object matching method that exploits both 2D contextual information from images and 3D object-level information.
We present two novel strategies to handle 3D objectness in the cost volume space: selective sampling (RoISelect) and 2D-3D fusion.
arXiv Detail & Related papers (2021-03-23T12:54:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.