Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video   Understanding
        - URL: http://arxiv.org/abs/2507.03531v1
 - Date: Fri, 04 Jul 2025 12:35:52 GMT
 - Title: Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video   Understanding
 - Authors: Namho Kim, Junhwa Kim, 
 - Abstract summary: We propose a framework that fuses video, image, and textcoding using GRU-based sequence encoders and cross-modal attention mechanisms.<n>Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines.
 - Score: 0.0
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance. 
 
       
      
        Related papers
        - Attention-Driven Multimodal Alignment for Long-term Action Quality   Assessment [5.262258418692889]
Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes.<n>Long-term Multimodal Attention Consistency Network (LMAC-Net) introduces a multimodal attention consistency mechanism to explicitly align multimodal features.<n>Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods.
arXiv  Detail & Related papers  (2025-07-29T15:58:39Z) - Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for   Multi-Modal Offensive Content Detection [0.0]
We introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion.<n>We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance.
arXiv  Detail & Related papers  (2025-05-25T07:26:00Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification   Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv  Detail & Related papers  (2025-05-21T12:29:40Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent   Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv  Detail & Related papers  (2024-05-24T15:56:40Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
  Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv  Detail & Related papers  (2023-09-22T06:55:41Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
  Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv  Detail & Related papers  (2023-08-06T09:15:14Z) - Learning Prompt-Enhanced Context Features for Weakly-Supervised Video
  Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges.
We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability.
Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv  Detail & Related papers  (2023-06-26T06:45:16Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
  Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv  Detail & Related papers  (2022-04-18T14:53:33Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
  Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv  Detail & Related papers  (2021-08-09T06:11:14Z) - Unsupervised Video Summarization with a Convolutional Attentive
  Adversarial Network [32.90753137435032]
We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way.
Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores.
The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
arXiv  Detail & Related papers  (2021-05-24T07:24:39Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.