AcousticFusion: Fusing Sound Source Localization to Visual SLAM in
Dynamic Environments
- URL: http://arxiv.org/abs/2108.01246v1
- Date: Tue, 3 Aug 2021 02:10:26 GMT
- Title: AcousticFusion: Fusing Sound Source Localization to Visual SLAM in
Dynamic Environments
- Authors: Tianwei Zhang, Huayan Zhang, Xiaofei Li, Junfeng Chen, Tin Lun Lam and
Sethu Vijayakumar
- Abstract summary: We propose a novel audio-visual fusion approach that fuses sound source direction into the RGB-D image.
The proposed method uses very small computational resources to obtain very stable self-localization results.
- Score: 19.413143126734383
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dynamic objects in the environment, such as people and other agents, lead to
challenges for existing simultaneous localization and mapping (SLAM)
approaches. To deal with dynamic environments, computer vision researchers
usually apply some learning-based object detectors to remove these dynamic
objects. However, these object detectors are computationally too expensive for
mobile robot on-board processing. In practical applications, these objects
output noisy sounds that can be effectively detected by on-board sound source
localization. The directional information of the sound source object can be
efficiently obtained by direction of sound arrival (DoA) estimation, but depth
estimation is difficult. Therefore, in this paper, we propose a novel
audio-visual fusion approach that fuses sound source direction into the RGB-D
image and thus removes the effect of dynamic obstacles on the multi-robot SLAM
system. Experimental results of multi-robot SLAM in different dynamic
environments show that the proposed method uses very small computational
resources to obtain very stable self-localization results.
Related papers
- V3D-SLAM: Robust RGB-D SLAM in Dynamic Environments with 3D Semantic Geometry Voting [1.3493547928462395]
Simultaneous localization and mapping (SLAM) in highly dynamic environments is challenging due to the correlation between moving objects and the camera pose.
We propose a robust method, V3D-SLAM, to remove moving objects via two lightweight re-evaluation stages.
Our experiment on the TUM RGB-D benchmark on dynamic sequences with ground-truth camera trajectories showed that our methods outperform the most recent state-of-the-art SLAM methods.
arXiv Detail & Related papers (2024-10-15T21:08:08Z) - ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling [57.1025908604556]
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment.
We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment.
We introduce ActiveRIR, a reinforcement learning policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions.
arXiv Detail & Related papers (2024-04-24T21:30:01Z) - DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM [5.267859554944985]
We introduce DDN-SLAM, the first real-time dense dynamic neural implicit SLAM system integrating semantic features.
Compared to existing neural implicit SLAM systems, the tracking results on dynamic datasets indicate an average 90% improvement in Average Trajectory Error (ATE) accuracy.
arXiv Detail & Related papers (2024-01-03T05:42:17Z) - Graphical Object-Centric Actor-Critic [55.2480439325792]
We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches.
We use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment.
Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm.
arXiv Detail & Related papers (2023-10-26T06:05:12Z) - Language-Conditioned Observation Models for Visual Object Search [12.498575839909334]
We bridge the gap in realistic object search by posing the problem as a partially observable Markov decision process (POMDP)
We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise.
We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.
arXiv Detail & Related papers (2023-09-13T19:30:53Z) - Latent Exploration for Reinforcement Learning [87.42776741119653]
In Reinforcement Learning, agents learn policies by exploring and interacting with the environment.
We propose LATent TIme-Correlated Exploration (Lattice), a method to inject temporally-correlated noise into the latent state of the policy network.
arXiv Detail & Related papers (2023-05-31T17:40:43Z) - Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation [68.56443382421878]
We propose a novel adaptive multi-source predictor for zero-shot video object segmentation (ZVOS)
In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously.
Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks.
arXiv Detail & Related papers (2023-03-18T10:19:29Z) - D2SLAM: Semantic visual SLAM based on the influence of Depth for Dynamic
environments [0.483420384410068]
We propose a novel approach to determine dynamic elements that lack generalization and scene awareness.
We use scene depth information that refines the accuracy of estimates from geometric and semantic modules.
The obtained results demonstrate the efficacy of the proposed method in providing accurate localization and mapping in dynamic environments.
arXiv Detail & Related papers (2022-10-16T22:13:59Z) - Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener.
We explore how to infer RIRs based on a sparse set of images and echoes observed in the space.
In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.