Related papers: Hierarchical Action Learning for Weakly-Supervised Action Segmentation

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

URL: http://arxiv.org/abs/2602.24275v1
Date: Fri, 27 Feb 2026 18:48:22 GMT
Title: Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Authors: Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao,
Abstract summary: We propose the Hierarchical Action Learning (textbfHAL) model for weakly-supervised action segmentation.<n>Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features.<n> Experimental results show that the textbfHAL model significantly outperforms existing methods for weakly-supervised action segmentation.
Score: 43.688046710022626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.

Related papers

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation [50.027425808733994]
textscNaVIDA is a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution.<n>textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions.<n>Experiments show that textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters.
arXiv Detail & Related papers (2026-01-26T06:16:17Z)
Learning Action Hierarchies via Hybrid Geometric Diffusion [10.176137688183575]
Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video.<n>We propose HybridTAS, a framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models.<n>Our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.
arXiv Detail & Related papers (2026-01-05T08:59:07Z)
Structured Agent Distillation for Large Language Model [56.38279355868093]
We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models.<n>Our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior.<n>Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines.
arXiv Detail & Related papers (2025-05-20T02:01:55Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning [13.411096520754507]
Existing video captioning methods merely provide shallow or simplistic representations of object behaviors.<n>We propose a dynamic action semantic-aware graph transformer to comprehensively capture the essence of object behavior.
arXiv Detail & Related papers (2025-02-19T14:16:47Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models. SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z)
POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization [26.506893363676678]
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action localization. POTLoc is designed to identify and track continuous action structures via a self-training strategy. It outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
arXiv Detail & Related papers (2023-10-20T15:28:06Z)
Learning Efficient Abstract Planning Models that Choose What to Predict [28.013014215441505]
We show that existing symbolic operator learning approaches fall short in many robotics domains. This is primarily because they attempt to learn operators that exactly predict all observed changes in the abstract state. We propose to learn operators that 'choose what to predict' by only modelling changes necessary for abstract planning to achieve specified goals.
arXiv Detail & Related papers (2022-08-16T13:12:59Z)
Semi-Supervised Few-Shot Atomic Action Recognition [59.587738451616495]
We propose a novel model for semi-supervised few-shot atomic action recognition. Our model features unsupervised and contrastive video embedding, loose action alignment, multi-head feature comparison, and attention-based aggregation. Experiments show that our model can attain high accuracy on representative atomic action datasets outperforming their respective state-of-the-art classification accuracy in full supervision setting.
arXiv Detail & Related papers (2020-11-17T03:59:05Z)
Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition [16.22360992454675]
Action recognition via 3D skeleton data is an emerging important topic in these years. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL. Our approach typically improves existing hand-crafted methods by 10-50% top-1 accuracy.
arXiv Detail & Related papers (2020-08-01T06:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.