Joint Self-Supervised Video Alignment and Action Segmentation
- URL: http://arxiv.org/abs/2503.16832v1
- Date: Fri, 21 Mar 2025 04:02:00 GMT
- Title: Joint Self-Supervised Video Alignment and Action Segmentation
- Authors: Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran,
- Abstract summary: We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework.<n>We first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior.<n>We extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation.
- Score: 6.734637459963131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.
Related papers
- CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion [27.567059323636112]
We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states.<n>Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning.
arXiv Detail & Related papers (2025-12-17T23:16:02Z) - SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment [76.60024640625478]
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps.<n>We propose a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies.<n>Our method maintains high-quality video generation while substantially reducing the number of inference steps.
arXiv Detail & Related papers (2025-08-08T07:26:34Z) - Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport [5.80788851503526]
We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from unlabeled procedural videos.<n>Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order.<n>We propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation.
arXiv Detail & Related papers (2025-07-21T12:09:12Z) - Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation [31.622109513774635]
We propose a novel approach to the action segmentation task for long, untrimmed videos.
By encoding a temporal consistency prior to a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation.
Our method does not require knowing the action order for a video to attain temporal consistency.
arXiv Detail & Related papers (2024-04-01T22:53:47Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Multi-entity Video Transformers for Fine-Grained Video Representation Learning [34.26732761916984]
We re-examine the design of transformer architectures for video representation learning.<n>A key aspect of our approach is the improved sharing of scene information in the temporal pipeline.<n>Our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic
Video Segmentation [10.82074185158027]
We introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation.
The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding.
MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots.
arXiv Detail & Related papers (2023-08-22T04:23:59Z) - Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z) - Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS)
Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime.
Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Motion-supervised Co-Part Segmentation [88.40393225577088]
We propose a self-supervised deep learning method for co-part segmentation.
Our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts.
arXiv Detail & Related papers (2020-04-07T09:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.