Related papers: PLAR: Prompt Learning for Action Recognition

PLAR: Prompt Learning for Action Recognition

URL: http://arxiv.org/abs/2305.12437v2
Date: Wed, 15 Nov 2023 02:59:32 GMT
Title: PLAR: Prompt Learning for Action Recognition
Authors: Xijun Wang, Ruiqi Xian, Tianrui Guan, Dinesh Manocha
Abstract summary: We present a new general learning approach, Prompt Learning for Action Recognition (PLAR) Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. We observe a 3.110-7.2% accuracy improvement on the aerial multi-agent dataset Okutamam and a 1.0-3.6% improvement on the ground camera single-agent dataset Something Something V2.
Score: 56.57236976757388
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a new general learning approach, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. In particular, we design a learnable prompt method that learns to dynamically generate prompts from a pool of prompt experts under different inputs. By sharing the same objective with the task, our proposed PLAR can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. We evaluate our approach on datasets consisting of both ground camera videos and aerial videos, and scenes with single-agent and multi-agent actions. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial multi-agent dataset Okutamam and a 1.0-3.6% improvement on the ground camera single-agent dataset Something Something V2. We plan to release our code on the WWW.

Related papers

PVChat: Personalized Video Chat with One-Shot Learning [15.328085576102106]
PVChat is a one-shot learning framework that enables subject-aware question answering from a single video for each subject. Our approach optimize a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage.
arXiv Detail & Related papers (2025-03-21T11:50:06Z)
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos [22.39414772037232]
PreMind is a novel multi-agent multimodal framework for understanding/indexing lecture videos. It generates multimodal indexes through three key steps: extracting slide visual content, transcribing speech narratives, and consolidating these visual and speech contents into an integrated understanding. Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self-reflection in vision analysis.
arXiv Detail & Related papers (2025-02-28T20:17:48Z)
Text-Enhanced Zero-Shot Action Recognition: A training-free approach [13.074211474150914]
We propose Text-Enhanced Action Recognition (TEAR) for zero-shot video action recognition. TEAR is training-free and does not require the availability of training data or extensive computational resources.
arXiv Detail & Related papers (2024-08-29T10:20:05Z)
DVANet: Disentangling View and Action Features for Multi-View Action Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z)
Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain. We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z)
Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations. Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z)
Self-Supervised Video Representation Learning with Motion-Contrastive Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet) MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP) Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z)
Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training. To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only. It is difficult to construct a suitable self-supervised task to well model both motion and appearance features. We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
Memory-augmented Dense Predictive Coding for Video Representation Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task. We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z)
Video Representation Learning with Visual Tempo Consistency [105.20094164316836]
We show that visual tempo can serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning.
arXiv Detail & Related papers (2020-06-28T02:46:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.