On the Efficacy of Text-Based Input Modalities for Action Anticipation
- URL: http://arxiv.org/abs/2401.12972v1
- Date: Tue, 23 Jan 2024 18:58:35 GMT
- Title: On the Efficacy of Text-Based Input Modalities for Action Anticipation
- Authors: Apoorva Beedu, Karan Samel, Irfan Essa
- Abstract summary: We propose a Multi-modal Anticipative Transformer (MAT) that jointly learns from multi-modal features and text captions.
We train our model in two-stages, where the model first learns to predict actions in the video clip by aligning with captions, and during the second stage, we fine-tune the model to predict future actions.
- Score: 18.92991055344741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although the task of anticipating future actions is highly uncertain,
information from additional modalities help to narrow down plausible action
choices. Each modality provides different environmental context for the model
to learn from. While previous multi-modal methods leverage information from
modalities such as video and audio, we primarily explore how text inputs for
actions and objects can also enable more accurate action anticipation.
Therefore, we propose a Multi-modal Anticipative Transformer (MAT), an
attention-based video transformer architecture that jointly learns from
multi-modal features and text captions. We train our model in two-stages, where
the model first learns to predict actions in the video clip by aligning with
captions, and during the second stage, we fine-tune the model to predict future
actions. Compared to existing methods, MAT has the advantage of learning
additional environmental context from two kinds of text inputs: action
descriptions during the pre-training stage, and the text inputs for detected
objects and actions during modality feature fusion. Through extensive
experiments, we evaluate the effectiveness of the pre-training stage, and show
that our model outperforms previous methods on all datasets. In addition, we
examine the impact of object and action information obtained via text and
perform extensive ablations. We evaluate the performance on on three datasets:
EpicKitchens-100, EpicKitchens-55 and EGTEA GAZE+; and show that text
descriptions do indeed aid in more effective action anticipation.
Related papers
- PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Rethinking Learning Approaches for Long-Term Action Anticipation [32.67768331823358]
Action anticipation involves predicting future actions having observed the initial portion of a video.
We introduce ANTICIPATR which performs long-term action anticipation.
We propose a two-stage learning approach to train a novel transformer-based model.
arXiv Detail & Related papers (2022-10-20T20:07:30Z) - Distilling Knowledge from Language Models for Video-based Action
Anticipation [31.59130630384036]
Anticipating future actions in a video is useful for many autonomous and assistive technologies.
We propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets.
arXiv Detail & Related papers (2022-10-12T08:02:11Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video [27.391434284586985]
Rolling-Unrolling LSTM is a learning architecture to anticipate actions from egocentric videos.
The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet.
arXiv Detail & Related papers (2020-05-04T14:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.