Related papers: REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport

REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport

URL: http://arxiv.org/abs/2509.24382v1
Date: Mon, 29 Sep 2025 07:32:14 GMT
Title: REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport
Authors: Soumyadeep Chandra, Kaushik Roy,
Abstract summary: Real-world instructional data often contains background segments, repeated actions, and steps presented out of order.<n>We introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT)<n>In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme.
Score: 7.952582509792969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning from procedural videos remains a core challenge in self-supervised representation learning, as real-world instructional data often contains background segments, repeated actions, and steps presented out of order. Such variability violates the strong monotonicity assumptions underlying many alignment methods. Prior state-of-the-art approaches, such as OPEL, leverage Kantorovich Optimal Transport (KOT) to build frame-to-frame correspondences, but rely solely on feature similarity and fail to capture the higher-order temporal structure of a task. In this paper, we introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT). In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme, enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos. To stabilize training, we integrate FPGWOT distances with inter-sequence contrastive learning, avoiding the need for multiple regularizers and preventing collapse to degenerate solutions. Across egocentric (EgoProceL) and third-person (ProceL, CrossTask) benchmarks, REALIGN achieves up to 18.9% average F1-score improvements and over 30% temporal IoU gains, while producing more interpretable transport maps that preserve key-step orderings and filter out noise.

Related papers

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z)
PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning [30.44007644340425]
We introduce PROPA, a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations.<n>Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines.<n>It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art.
arXiv Detail & Related papers (2025-11-13T13:06:12Z)
Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport [5.80788851503526]
We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from unlabeled procedural videos.<n>Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order.<n>We propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation.
arXiv Detail & Related papers (2025-07-21T12:09:12Z)
OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition [2.3093110834423616]
This paper presents OptiCorNet, a novel sequence modeling framework.<n>It unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module.<n>Our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.
arXiv Detail & Related papers (2025-07-19T04:29:43Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS) Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition [38.27785891922479]
Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2022-10-30T11:46:38Z)
Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS) Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z)
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task. We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs. We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.