Related papers: A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

URL: http://arxiv.org/abs/2511.17805v1
Date: Fri, 21 Nov 2025 21:59:22 GMT
Title: A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking
Authors: Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera,
Abstract summary: Procedural activities are highly structured as a set of actions conducted in a specific temporal order.<n>Current self-supervised learning methods often overlook the procedural nature that underpins such activities.<n>We propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal.
Score: 11.039713164587456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.

Related papers

Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models [67.75857052135154]
Deep knowledge tracing models have achieved significant breakthroughs in modeling student learning trajectories.<n>Traditional methods that require offline training on a fixed training set, our approach performs real-time ''live'' knowledge tracing in an online way.<n>We demonstrate, using several datasets of increasing size, that our method achieves predictive competitive performance with up to 273x speedups.
arXiv Detail & Related papers (2026-02-06T09:49:28Z)
AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection [11.750791465488438]
This paper studies the problem of class-incremental learning (CIL)<n>Traditional CIL methods, which do not leverage pre-trained models (PTMs), suffer from catastrophic forgetting (CF)<n>We propose AnaCP, a novel method that preserves the efficiency of analytic classifiers while enabling incremental feature adaptation without gradient-based training.
arXiv Detail & Related papers (2025-11-17T19:56:15Z)
DEAS: DEtached value learning with Action Sequence for Scalable Offline RL [46.40818333031899]
Action Sequence (DEAS) is a simple yet effective offline RL framework that leverages action sequences for value learning.<n>DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench.<n>It can be applied to enhance the performance of large-scale Vision-Language-Action models.
arXiv Detail & Related papers (2025-10-09T03:11:09Z)
REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport [7.952582509792969]
Real-world instructional data often contains background segments, repeated actions, and steps presented out of order.<n>We introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT)<n>In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme.
arXiv Detail & Related papers (2025-09-29T07:32:14Z)
Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.<n>Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.<n>We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation [48.08416841005715]
We introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation. It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator. Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency.
arXiv Detail & Related papers (2024-06-14T14:49:12Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
GLSFormer : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches. We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z)
Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning [45.13060970066485]
We propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS) Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-07-20T04:21:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.