Multi-Level LVLM Guidance for Untrimmed Video Action Recognition
- URL: http://arxiv.org/abs/2508.17442v1
- Date: Sun, 24 Aug 2025 16:45:21 GMT
- Title: Multi-Level LVLM Guidance for Untrimmed Video Action Recognition
- Authors: Liyang Peng, Sihan Zhu, Yunjie Guo,
- Abstract summary: This paper introduces the Event-Temporalized Video Transformer (ECVT), a novel architecture that bridges the gap between low-level visual features and high-level semantic information.<n>Experiments on ActivityNet v1.3 and THUMOS14 demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual cues are integrated into the video encoder's learning process through sophisticated mechanisms such as adaptive gating for high-level semantic fusion, cross-modal attention for fine-grained feature refinement, and an event graph module for temporal context calibration. Trained end-to-end with a comprehensive loss function incorporating semantic consistency and temporal calibration terms, ECVT significantly enhances the model's ability to understand video temporal structures and event logic. Extensive experiments on ActivityNet v1.3 and THUMOS14 datasets demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14, outperforming leading baselines.
Related papers
- Kwai Keye-VL 1.5 Technical Report [91.07838286692815]
We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations.<n>First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity.<n>Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens.<n>Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment.
arXiv Detail & Related papers (2025-09-01T15:46:58Z) - VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos [8.711160469571942]
This paper introduces VT-LVLM-AR (Video Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap.<n>VTEM transforms raw video into semantically rich, and temporally coherent "visual event sequences"<n>The framework consistently achieves state-of-the-art performance, surpassing existing methods.
arXiv Detail & Related papers (2025-08-21T18:03:16Z) - LET-US: Long Event-Text Understanding of Scenes [23.376693904132786]
Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution.<n>We introduce LET-US, a framework for long event-stream--text comprehension.<n>We use an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details.
arXiv Detail & Related papers (2025-08-10T16:02:41Z) - APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval [41.81696346270799]
Current large language models (LMs) struggle with hour-level video understanding.<n>bftextAdaptive textbfPivot MLbfVisual information textbfRetrieval (textbfAPVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information.
arXiv Detail & Related papers (2025-06-05T12:27:10Z) - Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification [47.40091830500585]
Video-based Visible-basedInfrared Person Re-Identification (VVIReID) aims to match pedestrian sequences across modalities by extracting modality-in sequence-level features.<n>A framework, video-level language-driven VVI-ReID (VLD), consists of two core modules: inmodality language (IMLP) and spatialtemporal aggregation.
arXiv Detail & Related papers (2025-06-03T04:49:08Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.<n> MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.<n>We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z) - Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.<n>CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)<n>We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.