Distilling Knowledge from Language Models for Video-based Action
Anticipation
- URL: http://arxiv.org/abs/2210.05991v1
- Date: Wed, 12 Oct 2022 08:02:11 GMT
- Title: Distilling Knowledge from Language Models for Video-based Action
Anticipation
- Authors: Sayontan Ghosh, Tanvi Aggarwal, Minh Hoai, Niranjan Balasubramanian
- Abstract summary: Anticipating future actions in a video is useful for many autonomous and assistive technologies.
We propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets.
- Score: 31.59130630384036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Anticipating future actions in a video is useful for many autonomous and
assistive technologies. Prior action anticipation work mostly treats this as a
vision modality problem, where the models learn the task information primarily
from the video features in the target action anticipation datasets. In this
work, we propose a method to make use of the text-modality that is available
during the training, to bring in complementary information that is not present
in the target action anticipation datasets. In particular, we leverage
pre-trained language models to build a text-modality teacher that is able to
predict future actions based on text labels of the past actions extracted from
the input video. To further adapt the teacher to the target domain (cooking),
we also pretrain the teacher on textual instructions from a recipes dataset
(Recipe1M). Then, we distill the knowledge gained by the text-modality teacher
into a vision-modality student to further improve it's performance. We
empirically evaluate this simple cross-modal distillation strategy on two video
datasets EGTEA-GAZE+ and EPIC-KITCHEN 55. Distilling this text-modality
knowledge into a strong vision model (Anticipative Vision Transformer) yields
consistent gains across both datasets, 3.5% relative improvement on top1 class
mean recall for EGTEA-GAZE+, 7.2% on top5 many-shot class mean recall for
EPIC-KITCHEN 55 and achieves new state-of-the-results.
Related papers
- VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning.
By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z) - On the Efficacy of Text-Based Input Modalities for Action Anticipation [15.567996062093089]
We propose a video transformer architecture that learns from multi-modal features and text descriptions of actions and objects.
We show that our model outperforms previous methods on the EpicKitchens datasets.
arXiv Detail & Related papers (2024-01-23T18:58:35Z) - Temporal DINO: A Self-supervised Video Strategy to Enhance Action
Prediction [15.696593695918844]
This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels)
The experimental results showcase significant improvements in prediction performance across 3D-ResNet, Transformer, and LSTM architectures.
These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.
arXiv Detail & Related papers (2023-08-08T21:18:23Z) - Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Rethinking Learning Approaches for Long-Term Action Anticipation [32.67768331823358]
Action anticipation involves predicting future actions having observed the initial portion of a video.
We introduce ANTICIPATR which performs long-term action anticipation.
We propose a two-stage learning approach to train a novel transformer-based model.
arXiv Detail & Related papers (2022-10-20T20:07:30Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.