Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment
- URL: http://arxiv.org/abs/2507.21945v1
- Date: Tue, 29 Jul 2025 15:58:39 GMT
- Title: Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment
- Authors: Xin Wang, Peng-Jie Li, Yuan-Yuan Shen,
- Abstract summary: Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes.<n>Long-term Multimodal Attention Consistency Network (LMAC-Net) introduces a multimodal attention consistency mechanism to explicitly align multimodal features.<n>Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods.
- Score: 5.262258418692889
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes. This task plays an important role in the automated evaluation of artistic sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal synchronization with background music are essential for performance assessment. However, existing methods predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a result, they struggle to capture complex interactions between modalities and fail to accurately track critical performance changes throughout extended sequences. To address these challenges, we propose the Long-term Multimodal Attention Consistency Network (LMAC-Net). LMAC-Net introduces a multimodal attention consistency mechanism to explicitly align multimodal features, enabling stable integration of visual and audio information and enhancing feature representations. Specifically, we introduce a multimodal local query encoder module to capture temporal semantics and cross-modal relations, and use a two-level score evaluation for interpretable results. In addition, attention-based and regression-based losses are applied to jointly optimize multimodal alignment and score fusion. Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods, validating the effectiveness of our proposed approach.
Related papers
- FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [50.438552588818]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models [0.0]
Modality-Aware Adaptive Fusion Scheduling (MA-AFS) learns to dynamically modulate the contribution of each modality on a per-instance basis.<n>Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.
arXiv Detail & Related papers (2025-06-15T05:57:45Z) - Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
We introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion.<n>We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance.
arXiv Detail & Related papers (2025-05-25T07:26:00Z) - InfoMAE: Pair-Efficient Cross-Modal Alignment for Multimodal Time-Series Sensing Signals [9.648001493025204]
InfoMAE is a cross-modal alignment framework that tackles the challenge of multimodal pair efficiency under the SSL setting.<n>It enhances downstream multimodal tasks by over 60% with significantly improved multimodal pairing efficiency.<n>It also improves unimodal task accuracy by an average of 22%.
arXiv Detail & Related papers (2025-04-13T20:03:29Z) - Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression [25.657978409890973]
Action Assessment (AQA) aims at automatic and fair evaluation of athletic performance.<n>Current methods focus on segmenting video into fixed frames, which disrupts the temporal continuity of sub-actions.<n>We propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression.
arXiv Detail & Related papers (2025-01-07T10:20:16Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.