Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
- URL: http://arxiv.org/abs/2506.01365v1
- Date: Mon, 02 Jun 2025 06:47:42 GMT
- Title: Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
- Authors: Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik,
- Abstract summary: This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper.<n>We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention.
- Score: 2.403252956256118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.
Related papers
- LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection [4.2649265429416445]
We propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection.<n>Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet)<n>Our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively.
arXiv Detail & Related papers (2025-06-26T05:32:33Z) - MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment [5.922172844641853]
This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding.<n>MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection.<n> Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance.
arXiv Detail & Related papers (2025-06-12T07:32:51Z) - MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition [2.719872133434811]
This paper introduces the Multi-Prior Fusion Network (MPFNet), leveraging a progressive training strategy to optimize affective computing tasks.<n>Inspired by developmental psychology, we present two variants of MPFNet--MPFNet-P and MPFNet--C--corresponding to two fundamental modes of infant cognitive parallel and hierarchical processing.
arXiv Detail & Related papers (2025-06-11T13:39:41Z) - TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition [5.9931594640934325]
Cross-modal attention-based fusion methods have demonstrated high performance and strong robustness.<n>We propose an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN)<n>The experimental results show that TACFN brings a significant performance improvement compared to other methods.
arXiv Detail & Related papers (2025-05-10T06:57:58Z) - Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation [7.627299398469962]
We propose a new Spectrum-based Modality Representation graph recommender.<n>It aims to capture both uni-modal and fusion preferences while simultaneously suppressing modality noise.<n>Experiments on three real-world datasets show the efficacy of our proposed model.
arXiv Detail & Related papers (2024-12-19T15:53:21Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - Multi-scale Quaternion CNN and BiGRU with Cross Self-attention Feature Fusion for Fault Diagnosis of Bearing [5.3598912592106345]
Deep learning has led to significant advances in bearing fault diagnosis (FD)
We propose a novel FD model by integrating multiscale quaternion convolutional neural network (MQCNN), bidirectional gated recurrent unit (BiG), and cross self-attention feature fusion (CSAFF)
arXiv Detail & Related papers (2024-05-25T07:55:02Z) - E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection [21.185032466325737]
We introduce E2E-MFD, a novel end-to-end algorithm for multimodal fusion detection.<n>E2E-MFD streamlines the process, achieving high performance with a single training phase.<n>Our extensive testing on multiple public datasets reveals E2E-MFD's superior capabilities.
arXiv Detail & Related papers (2024-03-14T12:12:17Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - Unified Contrastive Fusion Transformer for Multimodal Human Action
Recognition [13.104967563769533]
We introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer)
UCFFormer integrates data with diverse distributions to enhance human action recognition (HAR) performance.
We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer.
arXiv Detail & Related papers (2023-09-10T14:10:56Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos.
We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features.
Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.