Related papers: Learning to Align Sequential Actions in the Wild

Learning to Align Sequential Actions in the Wild

URL: http://arxiv.org/abs/2111.09301v1
Date: Wed, 17 Nov 2021 18:55:36 GMT
Title: Learning to Align Sequential Actions in the Wild
Authors: Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua, Marc Pollefeys
Abstract summary: We propose an approach to align sequential actions in the wild that involve diverse temporal variations. Our model accounts for both monotonic and non-monotonic sequences. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
Score: 123.62879270881807
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping across sequences, which does not leverage temporal information, or assume monotonic alignment between each video pair, which ignores variations in the order of actions. As such, these methods are not able to deal with common real-world scenarios that involve background frames or videos that contain non-monotonic sequence of actions. In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning on four different benchmark datasets.

Related papers

Generating Multimodal Driving Scenes via Next-Scene Prediction [24.84840824118813]
Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities. We introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.
arXiv Detail & Related papers (2025-03-19T07:20:16Z)
Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network [57.72095897427665]
temporal sentence grounding (TSG) aims to locate query-relevant segments in videos. Previous methods follow a single-thread framework that cannot co-train different pairs. We propose Multi-Pair TSG, which aims to co-train these pairs.
arXiv Detail & Related papers (2024-12-20T08:50:11Z)
Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences. We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies. We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z)
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering [89.0660110757949]
We exploit a simple proxy task of ordering a shuffled image sequence, with time' serving as a supervisory signal. We introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps.
arXiv Detail & Related papers (2024-04-25T17:59:56Z)
Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching [17.344430840048094]
Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. We develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Unlike the existing methods, we model this task as local matching and global aggregation.
arXiv Detail & Related papers (2024-03-16T01:38:28Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations. Such a challenging task resides in an open-set setting without prior action detection or segmentation. We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z)
Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds. We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled. We propose an approach that can be trained end-to-end on such data. We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.