On the Efficacy of Text-Based Input Modalities for Action Anticipation
- URL: http://arxiv.org/abs/2401.12972v1
- Date: Tue, 23 Jan 2024 18:58:35 GMT
- Title: On the Efficacy of Text-Based Input Modalities for Action Anticipation
- Authors: Apoorva Beedu, Karan Samel, Irfan Essa
- Abstract summary: We propose a Multi-modal Anticipative Transformer (MAT) that jointly learns from multi-modal features and text captions.
We train our model in two-stages, where the model first learns to predict actions in the video clip by aligning with captions, and during the second stage, we fine-tune the model to predict future actions.
- Score: 18.92991055344741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although the task of anticipating future actions is highly uncertain,
information from additional modalities help to narrow down plausible action
choices. Each modality provides different environmental context for the model
to learn from. While previous multi-modal methods leverage information from
modalities such as video and audio, we primarily explore how text inputs for
actions and objects can also enable more accurate action anticipation.
Therefore, we propose a Multi-modal Anticipative Transformer (MAT), an
attention-based video transformer architecture that jointly learns from
multi-modal features and text captions. We train our model in two-stages, where
the model first learns to predict actions in the video clip by aligning with
captions, and during the second stage, we fine-tune the model to predict future
actions. Compared to existing methods, MAT has the advantage of learning
additional environmental context from two kinds of text inputs: action
descriptions during the pre-training stage, and the text inputs for detected
objects and actions during modality feature fusion. Through extensive
experiments, we evaluate the effectiveness of the pre-training stage, and show
that our model outperforms previous methods on all datasets. In addition, we
examine the impact of object and action information obtained via text and
perform extensive ablations. We evaluate the performance on on three datasets:
EpicKitchens-100, EpicKitchens-55 and EGTEA GAZE+; and show that text
descriptions do indeed aid in more effective action anticipation.
Related papers
- Vision and Intention Boost Large Language Model in Long-Term Action Anticipation [39.66216219048517]
Long-term action anticipation aims to predict future actions over an extended period.<n>Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss.<n>We propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data.
arXiv Detail & Related papers (2025-05-03T06:33:54Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs [15.402143137362112]
Future interactive interfaces should intelligently provide quick access to digital actions based on users' context.
We generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs.
We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information.
arXiv Detail & Related papers (2024-05-06T23:11:00Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Distilling Knowledge from Language Models for Video-based Action
Anticipation [31.59130630384036]
Anticipating future actions in a video is useful for many autonomous and assistive technologies.
We propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets.
arXiv Detail & Related papers (2022-10-12T08:02:11Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Instance-Aware Predictive Navigation in Multi-Agent Environments [93.15055834395304]
We propose an Instance-Aware Predictive Control (IPC) approach, which forecasts interactions between agents as well as future scene structures.
We adopt a novel multi-instance event prediction module to estimate the possible interaction among agents in the ego-centric view.
We design a sequential action sampling strategy to better leverage predicted states on both scene-level and instance-level.
arXiv Detail & Related papers (2021-01-14T22:21:25Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.