Related papers: WDMIR: Wavelet-Driven Multimodal Intent Recognition

WDMIR: Wavelet-Driven Multimodal Intent Recognition

URL: http://arxiv.org/abs/2506.10011v1
Date: Tue, 27 May 2025 03:32:45 GMT
Title: WDMIR: Wavelet-Driven Multimodal Intent Recognition
Authors: Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu,
Abstract summary: This paper presents a novel Wavelet-Driven Multimodal Intent Recognition framework.<n>It enhances intent understanding through frequency-domain analysis of non-verbal information.<n>Our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy.
Score: 11.292250176088276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? [12.662031101992968]
We investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets.<n>Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels.<n>Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
arXiv Detail & Related papers (2024-09-13T22:18:45Z)
Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition [8.261744063074612]
We propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for Audio-Visual Emotion Recognition (AVER) We introduce optical flow information to enrich video representations with texture details that better capture facial state changes. A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations.
arXiv Detail & Related papers (2024-05-26T21:31:59Z)
Leveraging Speech for Gesture Detection in Multimodal Communication [3.798147784987455]
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures.
arXiv Detail & Related papers (2024-04-23T11:54:05Z)
AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts [8.809586885539002]
We propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios.
arXiv Detail & Related papers (2024-03-20T15:37:19Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z)
Interactive Fusion of Multi-level Features for Compositional Activity Recognition [100.75045558068874]
We present a novel framework that accomplishes this goal by interactive fusion. We implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction. We evaluate our approach on two action recognition datasets, Something-Something and Charades.
arXiv Detail & Related papers (2020-12-10T14:17:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.