Related papers: Dynamic Fusion Multimodal Network for SpeechWellness Detection

Dynamic Fusion Multimodal Network for SpeechWellness Detection

URL: http://arxiv.org/abs/2508.18057v2
Date: Mon, 01 Sep 2025 11:20:37 GMT
Title: Dynamic Fusion Multimodal Network for SpeechWellness Detection
Authors: Wenqiang Sun, Han Yin, Jisheng Bai, Jianfeng Chen,
Abstract summary: Suicide is one of the leading causes of death among adolescents.<n>Previous suicide risk prediction studies have primarily focused on either textual or acoustic information in isolation.<n>We explore a lightweight multi-branch multimodal system based on a dynamic fusion mechanism for speech detection.
Score: 7.169178956727836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Suicide is one of the leading causes of death among adolescents. Previous suicide risk prediction studies have primarily focused on either textual or acoustic information in isolation, the integration of multimodal signals, such as speech and text, offers a more comprehensive understanding of an individual's mental state. Motivated by this, and in the context of the 1st SpeechWellness detection challenge, we explore a lightweight multi-branch multimodal system based on a dynamic fusion mechanism for speechwellness detection. To address the limitation of prior approaches that rely on time-domain waveforms for acoustic analysis, our system incorporates both time-domain and time-frequency (TF) domain acoustic features, as well as semantic representations. In addition, we introduce a dynamic fusion block to adaptively integrate information from different modalities. Specifically, it applies learnable weights to each modality during the fusion process, enabling the model to adjust the contribution of each modality. To enhance computational efficiency, we design a lightweight structure by simplifying the original baseline model. Experimental results demonstrate that the proposed system exhibits superior performance compared to the challenge baseline, achieving a 78% reduction in model parameters and a 5% improvement in accuracy.

Related papers

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models [6.350443894942629]
Multimodal Weight Allocation Module (MWAM) is a plug-and-play component that dynamically re-balances the contribution of each branch during training.<n>MWAM delivers consistent performance gains across a wide range of tasks and modality combinations.
arXiv Detail & Related papers (2026-02-26T05:51:41Z)
Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening [0.7136933021609076]
This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment.<n>The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness.
arXiv Detail & Related papers (2026-01-17T03:35:39Z)
Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection [54.10252086842123]
Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos.<n>This paper proposes a modality optimization and dynamic primary modality selection framework (MODS)<n>Experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-11-09T11:13:32Z)
Multi-Modal Sentiment Analysis with Dynamic Attention Fusion [0.0]
We introduce Dynamic Attention Fusion (DAF), a lightweight framework that combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder.<n>Our proposed DAF model consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark.<n>By effectively integrating verbal and non-verbal information, our approach offers a more robust foundation for sentiment prediction.
arXiv Detail & Related papers (2025-09-25T09:54:04Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z)
Self-attention fusion for audiovisual emotion recognition with incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition. We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z)
Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z)
Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks. We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.