Paxion: Patching Action Knowledge in Video-Language Foundation Models
- URL: http://arxiv.org/abs/2305.10683v4
- Date: Sat, 21 Oct 2023 16:34:03 GMT
- Title: Paxion: Patching Action Knowledge in Video-Language Foundation Models
- Authors: Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng
Tang, Mohit Bansal, Heng Ji
- Abstract summary: Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.
Recent video-language models' impressive performance on various benchmark tasks reveal their surprising deficiency (near-random performance) in action knowledge.
We propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective.
- Score: 112.92853632161604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Action knowledge involves the understanding of textual, visual, and temporal
aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench)
containing two carefully designed probing tasks: Action Antonym and Video
Reversal, which targets multimodal alignment capabilities and temporal
understanding skills of the model, respectively. Despite recent video-language
models' (VidLM) impressive performance on various benchmark tasks, our
diagnostic tasks reveal their surprising deficiency (near-random performance)
in action knowledge, suggesting that current models rely on object recognition
abilities as a shortcut for action understanding. To remedy this, we propose a
novel framework, Paxion, along with a new Discriminative Video Dynamics
Modeling (DVDM) objective. The Paxion framework utilizes a Knowledge Patcher
network to encode new action knowledge and a Knowledge Fuser component to
integrate the Patcher into frozen VidLMs without compromising their existing
capabilities. Due to limitations of the widely-used Video-Text Contrastive
(VTC) loss for learning action knowledge, we introduce the DVDM objective to
train the Knowledge Patcher. DVDM forces the model to encode the correlation
between the action text and the correct ordering of video frames. Our extensive
analyses show that Paxion and DVDM together effectively fill the gap in action
knowledge understanding (~50% to 80%), while maintaining or improving
performance on a wide spectrum of both object- and action-centric downstream
tasks. The code and data will be made publicly available for research purposes
at https://github.com/MikeWangWZHL/Paxion.git.
Related papers
- Language Model Guided Interpretable Video Action Reasoning [32.999621421295416]
We present a new framework named Language-guided Interpretable Action Recognition framework (LaIAR)
LaIAR leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models.
In essence, we redefine the problem of understanding video model decisions as a task of aligning video and language models.
arXiv Detail & Related papers (2024-04-02T02:31:13Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - DEVIAS: Learning Disentangled Video Representations of Action and Scene [3.336126457178601]
Video recognition models often learn scene-biased action representation due to the spurious correlation between actions and scenes in the training data.
We propose a disentangling encoder-decoder architecture to learn disentangled action and scene representations with a single model.
We rigorously validate the proposed method on the UCF-101, Kinetics-400, and HVU datasets for the seen, and the SCUBA, HAT, and HVU datasets for unseen action-scene combination scenarios.
arXiv Detail & Related papers (2023-11-30T18:58:44Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Rich Action-semantic Consistent Knowledge for Early Action Prediction [20.866206453146898]
Early action prediction (EAP) aims to recognize human actions from a part of action execution in ongoing videos.
We partition original partial or full videos to form a new series of partial videos evolving in arbitrary progress levels.
A novel Rich Action-semantic Consistent Knowledge network (RACK) under the teacher-student framework is proposed for EAP.
arXiv Detail & Related papers (2022-01-23T03:39:31Z) - ActionCLIP: A New Paradigm for Video Action Recognition [14.961103794667341]
We provide a new perspective on action recognition by attaching importance to the semantic information of label texts.
We propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune"
arXiv Detail & Related papers (2021-09-17T11:21:34Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.