Related papers: Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding

Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding

URL: http://arxiv.org/abs/2507.03531v1
Date: Fri, 04 Jul 2025 12:35:52 GMT
Title: Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding
Authors: Namho Kim, Junhwa Kim,
Abstract summary: We propose a framework that fuses video, image, and textcoding using GRU-based sequence encoders and cross-modal attention mechanisms.<n>Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance.

Related papers

Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment [5.262258418692889]
Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes.<n>Long-term Multimodal Attention Consistency Network (LMAC-Net) introduces a multimodal attention consistency mechanism to explicitly align multimodal features.<n>Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods.
arXiv Detail & Related papers (2025-07-29T15:58:39Z)
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
We introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion.<n>We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance.
arXiv Detail & Related papers (2025-05-25T07:26:00Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges. We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z)
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z)
Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video. WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event. We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z)
Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network [32.90753137435032]
We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way. Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores. The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2021-05-24T07:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.