Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models
- URL: http://arxiv.org/abs/2508.00374v1
- Date: Fri, 01 Aug 2025 07:07:24 GMT
- Title: Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models
- Authors: Yuji Sato, Yasunori Ishii, Takayoshi Yamashita,
- Abstract summary: Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics.<n>Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature.<n>The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model.
- Score: 6.88204255655161
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.
Related papers
- Enhancing Human Motion Prediction via Multi-range Decoupling Decoding with Gating-adjusting Aggregation [19.11704999742834]
Expressive representation of pose sequences is crucial for accurate motion modeling in human motion prediction.<n>Recent deep learning-based methods tend to overlook the varying relevance and dependencies between historical information and future moments.<n>We propose a novel approach called multi-range decoupling decoding with gating-adjusting aggregation.
arXiv Detail & Related papers (2025-03-30T10:10:31Z) - Fine-Grained Behavior and Lane Constraints Guided Trajectory Prediction Method [3.303114252531234]
We present BLNet, a novel dualstream architecture that integrates behavioral intention recognition and lane constraint modeling.<n>Our network exhibits significant performance gains over existing direct regression and goal-based algorithms.
arXiv Detail & Related papers (2025-03-27T13:06:57Z) - HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z) - Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation [30.161471749050833]
We propose a novel end-to-end video modeling architecture that utilizes attention mechanisms, named Anticipation via Recognition and Reasoning (ARR)
ARR decomposes the action anticipation task into action recognition and reasoning tasks, and effectively learns the statistical relationship between actions by next action prediction (NAP)
In addition, to address the challenge of relationship modeling that requires extensive training data, we propose an innovative approach for the unsupervised pre-training of the decoder.
arXiv Detail & Related papers (2024-08-05T18:38:29Z) - GDTS: Goal-Guided Diffusion Model with Tree Sampling for Multi-Modal Pedestrian Trajectory Prediction [15.731398013255179]
We propose a novel Goal-Guided Diffusion Model with Tree Sampling for multi-modal trajectory prediction.<n>A two-stage tree sampling algorithm is presented, which leverages common features to reduce the inference time and improve accuracy for multi-modal prediction.<n> Experimental results demonstrate that our proposed framework achieves comparable state-of-the-art performance with real-time inference speed in public datasets.
arXiv Detail & Related papers (2023-11-25T03:55:06Z) - Motion-Scenario Decoupling for Rat-Aware Video Position Prediction:
Strategy and Benchmark [49.58762201363483]
We introduce RatPose, a bio-robot motion prediction dataset constructed by considering the influence factors of individuals and environments.
We propose a Dual-stream Motion-Scenario Decoupling framework that effectively separates scenario-oriented and motion-oriented features.
We demonstrate significant performance improvements of the proposed textitDMSD framework on different difficulty-level tasks.
arXiv Detail & Related papers (2023-05-17T14:14:31Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - The Wisdom of Crowds: Temporal Progressive Attention for Early Action
Prediction [104.628661890361]
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video.
We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
arXiv Detail & Related papers (2022-04-28T08:21:09Z) - Temporally-Continuous Probabilistic Prediction using Polynomial
Trajectory Parameterization [12.896275507449936]
A commonly-used representation for motion prediction of actors is a sequence of waypoints for each actor at discrete future time-points.
This approach is simple and flexible, but it can exhibit unrealistic higher-order derivatives and approximation errors at intermediate time steps.
We propose a simple and general representation for temporally continuous trajectory prediction that is based on trajectory parameterization.
arXiv Detail & Related papers (2020-11-01T01:51:44Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.