Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization
- URL: http://arxiv.org/abs/2509.05695v1
- Date: Sat, 06 Sep 2025 12:11:43 GMT
- Title: Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization
- Authors: Jingwei Peng, Zhixuan Qiu, Boyu Jin, Surasakdi Siripong,
- Abstract summary: This paper introduces a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition.<n>Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent "semantic action tokens"<n>These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM for robust action classification and semantic reasoning.
- Score: 1.6799377888527687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition often struggles with deep semantic understanding, complex contextual information, and fine-grained distinction, limitations that traditional methods frequently encounter when dealing with diverse video data. Inspired by the remarkable capabilities of large language models, this paper introduces LVLM-VAR, a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition, emphasizing enhanced accuracy and interpretability. Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent "semantic action tokens," effectively crafting an "action narrative" that is comprehensible to an LVLM. These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM (e.g., LLaVA-13B) for robust action classification and semantic reasoning. LVLM-VAR not only achieves state-of-the-art or highly competitive performance on challenging benchmarks such as NTU RGB+D and NTU RGB+D 120, demonstrating significant improvements (e.g., 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set), but also substantially boosts model interpretability by generating natural language explanations for its predictions.
Related papers
- Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning [11.901989132359676]
We introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual reinforcement learning (RL)<n>Semore simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows.<n>Our method exhibits efficient and adaptive ability compared to state-of-art methods.
arXiv Detail & Related papers (2025-12-04T16:54:41Z) - Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes [54.50887214639301]
We propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords.<n>This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation.<n>In our zero-shot experiments, our model achieves accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively.
arXiv Detail & Related papers (2025-10-31T07:45:44Z) - VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos [8.711160469571942]
This paper introduces VT-LVLM-AR (Video Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap.<n>VTEM transforms raw video into semantically rich, and temporally coherent "visual event sequences"<n>The framework consistently achieves state-of-the-art performance, surpassing existing methods.
arXiv Detail & Related papers (2025-08-21T18:03:16Z) - Aligning Effective Tokens with Video Anomaly in Large Language Models [52.620554265703916]
We propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos.<n>Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules.<n>We construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs.
arXiv Detail & Related papers (2025-08-08T14:30:05Z) - Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z) - SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning [50.98341607245458]
Masked video modeling is an effective paradigm for video self-supervised learning (SSL)<n>This paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics.<n>We establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data.
arXiv Detail & Related papers (2025-04-01T08:20:55Z) - Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.<n>CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)<n>We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.