DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction
- URL: http://arxiv.org/abs/2504.10070v2
- Date: Wed, 16 Apr 2025 06:02:02 GMT
- Title: DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction
- Authors: Kiana Hooshanfar, Alireza Hosseini, Ahmad Kalhor, Babak Nadjar Araabi,
- Abstract summary: Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos.<n>We propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency.
- Score: 5.13730975608994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.
Related papers
- AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.<n>Our framework incorporates two key components for video understanding and cross-modal learning.<n>Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z) - Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP.
The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements.
The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
We present LoCo, a Locality-aware cross-modal Correspondence learning framework for Audio-Visual Events (DAVE)<n>LoCo applies Locality-aware Correspondence Correction (LCC) to unimodal features via leveraging cross-modal local-correlated properties.<n>We further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [7.864898315909104]
A challenging problem known as Audio-Visual (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene.<n>We introduce a novel Progressive Confident Masking Attention Network (PMCANet)<n>It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
Interactions through Masked Modeling [24.346868432774453]
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment.
This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models.
We address training early fusion architectures by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion.
We propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions.
arXiv Detail & Related papers (2023-12-02T03:38:49Z) - Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization [8.633822294082943]
This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
arXiv Detail & Related papers (2023-10-05T10:54:33Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.