Related papers: On the Efficacy of Text-Based Input Modalities for Action Anticipation

On the Efficacy of Text-Based Input Modalities for Action Anticipation

URL: http://arxiv.org/abs/2401.12972v3
Date: Thu, 29 Aug 2024 15:11:29 GMT
Title: On the Efficacy of Text-Based Input Modalities for Action Anticipation
Authors: Apoorva Beedu, Harish Haresamudram, Karan Samel, Irfan Essa,
Abstract summary: We propose a video transformer architecture that learns from multi-modal features and text descriptions of actions and objects. We show that our model outperforms previous methods on the EpicKitchens datasets.
Score: 15.567996062093089
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, information from different modalities help narrow down plausible action choices. Each modality can provide diverse and often complementary context for the model to learn from. While previous multi-modal methods leverage information from modalities such as video and audio, we primarily explore how text descriptions of actions and objects can also lead to more accurate action anticipation by providing additional contextual cues, e.g., about the environment and its contents. We propose a Multi-modal Contrastive Anticipative Transformer (M-CAT), a video transformer architecture that jointly learns from multi-modal features and text descriptions of actions and objects. We train our model in two stages, where the model first learns to align video clips with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Compared to existing methods, M-CAT has the advantage of learning additional context from two types of text inputs: rich descriptions of future actions during pre-training, and, text descriptions for detected objects and actions during modality feature fusion. Through extensive experimental evaluation, we demonstrate that our model outperforms previous methods on the EpicKitchens datasets, and show that using simple text descriptions of actions and objects aid in more effective action anticipation. In addition, we examine the impact of object and action information obtained via text, and perform extensive ablations.

Related papers

Vision and Intention Boost Large Language Model in Long-Term Action Anticipation [39.66216219048517]
Long-term action anticipation aims to predict future actions over an extended period.<n>Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss.<n>We propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data.
arXiv Detail & Related papers (2025-05-03T06:33:54Z)
Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism. Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z)
OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs [15.402143137362112]
Future interactive interfaces should intelligently provide quick access to digital actions based on users' context. We generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information.
arXiv Detail & Related papers (2024-05-06T23:11:00Z)
PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z)
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA) We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions. Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z)
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions. We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z)
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z)
Distilling Knowledge from Language Models for Video-based Action Anticipation [31.59130630384036]
Anticipating future actions in a video is useful for many autonomous and assistive technologies. We propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets.
arXiv Detail & Related papers (2022-10-12T08:02:11Z)
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects. We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Instance-Aware Predictive Navigation in Multi-Agent Environments [93.15055834395304]
We propose an Instance-Aware Predictive Control (IPC) approach, which forecasts interactions between agents as well as future scene structures. We adopt a novel multi-instance event prediction module to estimate the possible interaction among agents in the ego-centric view. We design a sequential action sampling strategy to better leverage predicted states on both scene-level and instance-level.
arXiv Detail & Related papers (2021-01-14T22:21:25Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.