Related papers: Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation

Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation

URL: http://arxiv.org/abs/2509.21377v1
Date: Tue, 23 Sep 2025 09:31:00 GMT
Title: Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
Authors: Yinfeng Yu, Hailong Zhang, Meiling Zhu,
Abstract summary: We propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN)<n>Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information.<n>Experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA)
Score: 3.3359927518257866
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.

Related papers

Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion [2.7745600113170994]
We introduce the MultiSensor-Home dataset, a novel benchmark for comprehensive action recognition in home environments.<n>We also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method.
arXiv Detail & Related papers (2025-04-03T05:23:08Z)
STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking [8.238662377845142]
We present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.
arXiv Detail & Related papers (2024-10-08T12:15:17Z)
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z)
AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies [8.01792778132834]
AVT2-DWF aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-03-22T06:04:37Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving [64.58258341591929]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. We establish the first set of large-scale AR-MOT benchmarks.
arXiv Detail & Related papers (2024-02-28T12:50:16Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Pay Self-Attention to Audio-Visual Navigation [24.18976027602831]
We propose an end-to-end framework to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy. Our thorough experiments validate the superior performance of FSAAVN in comparison with the state-of-the-arts.
arXiv Detail & Related papers (2022-10-04T03:42:36Z)
MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data. We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively. To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z)
Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks. We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
HMS: Hierarchical Modality Selection for Efficient Video Recognition [69.2263841472746]
This paper introduces Hierarchical Modality Selection (HMS), a simple yet efficient multimodal learning framework for efficient video recognition. HMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance.
arXiv Detail & Related papers (2021-04-20T04:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.