Related papers: PALM: Predicting Actions through Language Models

PALM: Predicting Actions through Language Models

URL: http://arxiv.org/abs/2311.17944v2
Date: Thu, 18 Jul 2024 10:31:53 GMT
Title: PALM: Predicting Actions through Language Models
Authors: Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang,
Abstract summary: We introduce PALM, an approach that tackles the task of long-term action anticipation. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
Score: 74.10147822693791
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional methods heavily rely on representation learning that is trained on a large amount of video data. However, a major challenge arises from the difficulty of obtaining effective video representation. This difficulty stems from the complex and variable nature of human activities, which contrasts with the limited availability of data. In this study, we introduce PALM, an approach that tackles the task of long-term action anticipation, which aims to forecast forthcoming sequences of actions over an extended period. Our method PALM incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement maximal marginal relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate PALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies.

Related papers

Multimodal Large Models Are Effective Action Anticipators [10.454791411515812]
ActionLLM is a novel approach that treats video sequences as successive tokens, leveraging Large Language Models to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding.
arXiv Detail & Related papers (2025-01-01T10:16:10Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks. Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events. We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation [30.161471749050833]
We propose a novel end-to-end video modeling architecture that utilizes attention mechanisms, named Anticipation via Recognition and Reasoning (ARR) ARR decomposes the action anticipation task into action recognition and reasoning tasks, and effectively learns the statistical relationship between actions by next action prediction (NAP) In addition, to address the challenge of relationship modeling that requires extensive training data, we propose an innovative approach for the unsupervised pre-training of the decoder.
arXiv Detail & Related papers (2024-08-05T18:38:29Z)
Temporal Grounding of Activities using Multimodal Large Language Models [0.0]
We evaluate the effectiveness of combining image-based and text-based large language models (LLMs) in a two-stage approach for temporal activity localization. We demonstrate that our method outperforms existing video-based LLMs.
arXiv Detail & Related papers (2024-05-30T09:11:02Z)
Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction [15.696593695918844]
This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels) The experimental results showcase significant improvements in prediction performance across 3D-ResNet, Transformer, and LSTM architectures. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.
arXiv Detail & Related papers (2023-08-08T21:18:23Z)
Look, Remember and Reason: Grounded reasoning in videos with language models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)
Learning the Effects of Physical Actions in a Multi-modal Environment [17.757831697284498]
Large Language Models (LLMs) handle physical commonsense information inadequately. We introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs. We show that multi-modal models can capture physical commonsense when augmented with visual information.
arXiv Detail & Related papers (2023-01-27T16:49:52Z)
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video. The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs) We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation. We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification. We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.