On the Efficacy of Text-Based Input Modalities for Action Anticipation
- URL: http://arxiv.org/abs/2401.12972v3
- Date: Thu, 29 Aug 2024 15:11:29 GMT
- Title: On the Efficacy of Text-Based Input Modalities for Action Anticipation
- Authors: Apoorva Beedu, Harish Haresamudram, Karan Samel, Irfan Essa,
- Abstract summary: We propose a video transformer architecture that learns from multi-modal features and text descriptions of actions and objects.
We show that our model outperforms previous methods on the EpicKitchens datasets.
- Score: 15.567996062093089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, information from different modalities help narrow down plausible action choices. Each modality can provide diverse and often complementary context for the model to learn from. While previous multi-modal methods leverage information from modalities such as video and audio, we primarily explore how text descriptions of actions and objects can also lead to more accurate action anticipation by providing additional contextual cues, e.g., about the environment and its contents. We propose a Multi-modal Contrastive Anticipative Transformer (M-CAT), a video transformer architecture that jointly learns from multi-modal features and text descriptions of actions and objects. We train our model in two stages, where the model first learns to align video clips with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Compared to existing methods, M-CAT has the advantage of learning additional context from two types of text inputs: rich descriptions of future actions during pre-training, and, text descriptions for detected objects and actions during modality feature fusion. Through extensive experimental evaluation, we demonstrate that our model outperforms previous methods on the EpicKitchens datasets, and show that using simple text descriptions of actions and objects aid in more effective action anticipation. In addition, we examine the impact of object and action information obtained via text, and perform extensive ablations.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs [15.402143137362112]
Future interactive interfaces should intelligently provide quick access to digital actions based on users' context.
We generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs.
We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information.
arXiv Detail & Related papers (2024-05-06T23:11:00Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Distilling Knowledge from Language Models for Video-based Action
Anticipation [31.59130630384036]
Anticipating future actions in a video is useful for many autonomous and assistive technologies.
We propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets.
arXiv Detail & Related papers (2022-10-12T08:02:11Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Instance-Aware Predictive Navigation in Multi-Agent Environments [93.15055834395304]
We propose an Instance-Aware Predictive Control (IPC) approach, which forecasts interactions between agents as well as future scene structures.
We adopt a novel multi-instance event prediction module to estimate the possible interaction among agents in the ego-centric view.
We design a sequential action sampling strategy to better leverage predicted states on both scene-level and instance-level.
arXiv Detail & Related papers (2021-01-14T22:21:25Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.