A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking
- URL: http://arxiv.org/abs/2511.17805v1
- Date: Fri, 21 Nov 2025 21:59:22 GMT
- Title: A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking
- Authors: Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera,
- Abstract summary: Procedural activities are highly structured as a set of actions conducted in a specific temporal order.<n>Current self-supervised learning methods often overlook the procedural nature that underpins such activities.<n>We propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal.
- Score: 11.039713164587456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.
Related papers
- Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models [67.75857052135154]
Deep knowledge tracing models have achieved significant breakthroughs in modeling student learning trajectories.<n>Traditional methods that require offline training on a fixed training set, our approach performs real-time ''live'' knowledge tracing in an online way.<n>We demonstrate, using several datasets of increasing size, that our method achieves predictive competitive performance with up to 273x speedups.
arXiv Detail & Related papers (2026-02-06T09:49:28Z) - AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection [11.750791465488438]
This paper studies the problem of class-incremental learning (CIL)<n>Traditional CIL methods, which do not leverage pre-trained models (PTMs), suffer from catastrophic forgetting (CF)<n>We propose AnaCP, a novel method that preserves the efficiency of analytic classifiers while enabling incremental feature adaptation without gradient-based training.
arXiv Detail & Related papers (2025-11-17T19:56:15Z) - DEAS: DEtached value learning with Action Sequence for Scalable Offline RL [46.40818333031899]
Action Sequence (DEAS) is a simple yet effective offline RL framework that leverages action sequences for value learning.<n>DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench.<n>It can be applied to enhance the performance of large-scale Vision-Language-Action models.
arXiv Detail & Related papers (2025-10-09T03:11:09Z) - REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport [7.952582509792969]
Real-world instructional data often contains background segments, repeated actions, and steps presented out of order.<n>We introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT)<n>In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme.
arXiv Detail & Related papers (2025-09-29T07:32:14Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.<n>Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.<n>We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation [48.08416841005715]
We introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation.
It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator.
Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency.
arXiv Detail & Related papers (2024-06-14T14:49:12Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - Hierarchically Self-Supervised Transformer for Human Skeleton
Representation Learning [45.13060970066485]
We propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS)
Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-07-20T04:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.