Enhancing Video Transformers for Action Understanding with VLM-aided Training
- URL: http://arxiv.org/abs/2403.16128v1
- Date: Sun, 24 Mar 2024 12:55:50 GMT
- Title: Enhancing Video Transformers for Action Understanding with VLM-aided Training
- Authors: Hui Lu, Hu Jian, Ronald Poppe, Albert Ali Salah,
- Abstract summary: We propose a framework that takes advantage of the complementary strengths of ViTs and VLMs.
The FTP framework adds processors that focus on specific aspects of human action in videos.
We achieve remarkable top-1 accuracy of 93.8% on Kinetics-400 and Something Something-Something V2, surpassing VideoMAEv2 by 2.8% and 2.6%, respectively.
- Score: 10.02739652443895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Owing to their ability to extract relevant spatio-temporal video embeddings, Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, their generalization over domains or datasets is somewhat limited. In contrast, Visual Language Models (VLMs) have demonstrated exceptional generalization performance, but are currently unable to process videos. Consequently, they cannot extract spatio-temporal patterns that are crucial for action understanding. In this paper, we propose the Four-tiered Prompts (FTP) framework that takes advantage of the complementary strengths of ViTs and VLMs. We retain ViTs' strong spatio-temporal representation ability but improve the visual encodings to be more comprehensive and general by aligning them with VLM outputs. The FTP framework adds four feature processors that focus on specific aspects of human action in videos: action category, action components, action description, and context information. The VLMs are only employed during training, and inference incurs a minimal computation cost. Our approach consistently yields state-of-the-art performance. For instance, we achieve remarkable top-1 accuracy of 93.8% on Kinetics-400 and 83.4% on Something-Something V2, surpassing VideoMAEv2 by 2.8% and 2.6%, respectively.
Related papers
- SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding [23.96372422130216]
Video-based Large Language Models (VideoVid-LLMs) have witnessed substantial advancements in recent years.
They struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries.
To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks greatly improve their fine-grained video understanding abilities.
arXiv Detail & Related papers (2025-04-10T13:40:34Z) - LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.
Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.
We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z) - Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model [60.171601995737646]
Mobile-VideoGPT is an efficient multimodal framework for video understanding.
It consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM)
Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction [17.038321383586037]
Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently.
Current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language.
We propose the Video Visual Prompt Benchmark(V2P-Bench), a benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios.
arXiv Detail & Related papers (2025-03-22T11:30:46Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations [23.188508465235717]
We propose two strategies to enhance the model's capability in video understanding tasks.
The first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE.
The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask.
arXiv Detail & Related papers (2024-09-05T02:54:17Z) - Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - Video Action Recognition with Attentive Semantic Units [23.384091957466588]
We exploit the semantic units () hiding behind the action labels for more accurate action recognition.
We introduce a multi-region module (MRA) to the visual branch of the Visual-Language Models (VLMs)
In fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T03:44:15Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - ViLT: Vision-and-Language Transformer Without Convolution or Region
Supervision [10.584604416749965]
We present a minimal Vision-and-Language Transformer (ViLT) model for vision-and-language downstream tasks.
ViLT is monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs.
arXiv Detail & Related papers (2021-02-05T18:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.