Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
- URL: http://arxiv.org/abs/2507.15540v1
- Date: Mon, 21 Jul 2025 12:09:12 GMT
- Title: Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
- Authors: Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M. Zeeshan Zia, Quoc-Huy Tran,
- Abstract summary: We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from unlabeled procedural videos.<n>Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order.<n>We propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation.
- Score: 5.80788851503526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.
Related papers
- REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport [7.952582509792969]
Real-world instructional data often contains background segments, repeated actions, and steps presented out of order.<n>We introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT)<n>In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme.
arXiv Detail & Related papers (2025-09-29T07:32:14Z) - Joint Self-Supervised Video Alignment and Action Segmentation [6.734637459963131]
We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework.<n>We first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior.<n>We extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation.
arXiv Detail & Related papers (2025-03-21T04:02:00Z) - Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation [31.622109513774635]
We propose a novel approach to the action segmentation task for long, untrimmed videos.
By encoding a temporal consistency prior to a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation.
Our method does not require knowing the action order for a video to attain temporal consistency.
arXiv Detail & Related papers (2024-04-01T22:53:47Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Modelling Latent Dynamics of StyleGAN using Neural ODEs [52.03496093312985]
We learn the trajectory of independently inverted latent codes from GANs.
The learned continuous trajectory allows us to perform infinite frame and consistent video manipulation.
Our method achieves state-of-the-art performance but with much less computation.
arXiv Detail & Related papers (2022-08-23T21:20:38Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - On Constraints in First-Order Optimization: A View from Non-Smooth
Dynamical Systems [99.59934203759754]
We introduce a class of first-order methods for smooth constrained optimization.
Two distinctive features of our approach are that projections or optimizations over the entire feasible set are avoided.
The resulting algorithmic procedure is simple to implement even when constraints are nonlinear.
arXiv Detail & Related papers (2021-07-17T11:45:13Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Deep Shells: Unsupervised Shape Correspondence with Optimal Transport [52.646396621449]
We propose a novel unsupervised learning approach to 3D shape correspondence.
We show that the proposed method significantly improves over the state-of-the-art on multiple datasets.
arXiv Detail & Related papers (2020-10-28T22:24:07Z) - Panoster: End-to-end Panoptic Segmentation of LiDAR Point Clouds [81.12016263972298]
We present Panoster, a novel proposal-free panoptic segmentation method for LiDAR point clouds.
Unlike previous approaches, Panoster proposes a simplified framework incorporating a learning-based clustering solution to identify instances.
At inference time, this acts as a class-agnostic segmentation, allowing Panoster to be fast, while outperforming prior methods in terms of accuracy.
arXiv Detail & Related papers (2020-10-28T18:10:20Z) - Memory Clustering using Persistent Homology for Multimodality- and
Discontinuity-Sensitive Learning of Optimal Control Warm-starts [24.576214898129823]
Shooting methods are an efficient approach to solving nonlinear optimal control problems.
Recent work has focused on providing an initial guess from a learned model trained on samples generated during an offline exploration of the problem space.
In this work, we apply tools from algebraic topology to extract information on the underlying structure of the solution space.
arXiv Detail & Related papers (2020-10-02T14:24:59Z) - Weakly supervised cross-domain alignment with optimal transport [102.8572398001639]
Cross-domain alignment between image objects and text sequences is key to many visual-language tasks.
This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities.
arXiv Detail & Related papers (2020-08-14T22:48:36Z) - Structure Adaptive Algorithms for Stochastic Bandits [22.871155520200773]
We study reward maximisation in a class of structured multi-armed bandit problems.
The mean rewards of arms satisfy some given structural constraints.
We develop algorithms from instance-dependent lower-bounds using iterative saddle-point solvers.
arXiv Detail & Related papers (2020-07-02T08:59:54Z) - FedSplit: An algorithmic framework for fast federated optimization [40.42352500741025]
We introduce FedSplit, a class of algorithms for solving distributed convex minimization with additive structure.
Our theory shows that these methods are provably robust to inexact computation of intermediate local quantities.
arXiv Detail & Related papers (2020-05-11T16:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.