Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations
- URL: http://arxiv.org/abs/2301.02184v2
- Date: Thu, 20 Apr 2023 05:26:28 GMT
- Title: Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations
- Authors: Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul
Calamia, Kristen Grauman, Vamsi Krishna Ithapu
- Abstract summary: We build the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation.
We present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space.
Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff.
- Score: 65.37621891132729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can conversational videos captured from multiple egocentric viewpoints reveal
the map of a scene in a cost-efficient way? We seek to answer this question by
proposing a new problem: efficiently building the map of a previously unseen 3D
environment by exploiting shared information in the egocentric audio-visual
observations of participants in a natural conversation. Our hypothesis is that
as multiple people ("egos") move in a scene and talk among themselves, they
receive rich audio-visual cues that can help uncover the unseen areas of the
scene. Given the high cost of continuously processing egocentric visual
streams, we further explore how to actively coordinate the sampling of visual
information, so as to minimize redundancy and reduce power use. To that end, we
present an audio-visual deep reinforcement learning approach that works with
our shared scene mapper to selectively turn on the camera to efficiently chart
out the space. We evaluate the approach using a state-of-the-art audio-visual
simulator for 3D scenes as well as real-world video. Our model outperforms
previous state-of-the-art mapping methods, and achieves an excellent
cost-accuracy tradeoff. Project: http://vision.cs.utexas.edu/projects/chat2map.
Related papers
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - Telling Left from Right: Learning Spatial Correspondence of Sight and
Sound [16.99266133458188]
We propose a novel self-supervised task to leverage a principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream.
We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams.
We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines.
arXiv Detail & Related papers (2020-06-11T04:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.