Learning to Align Sequential Actions in the Wild
- URL: http://arxiv.org/abs/2111.09301v1
- Date: Wed, 17 Nov 2021 18:55:36 GMT
- Title: Learning to Align Sequential Actions in the Wild
- Authors: Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua,
Marc Pollefeys
- Abstract summary: We propose an approach to align sequential actions in the wild that involve diverse temporal variations.
Our model accounts for both monotonic and non-monotonic sequences.
We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
- Score: 123.62879270881807
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: State-of-the-art methods for self-supervised sequential action alignment rely
on deep networks that find correspondences across videos in time. They either
learn frame-to-frame mapping across sequences, which does not leverage temporal
information, or assume monotonic alignment between each video pair, which
ignores variations in the order of actions. As such, these methods are not able
to deal with common real-world scenarios that involve background frames or
videos that contain non-monotonic sequence of actions.
In this paper, we propose an approach to align sequential actions in the wild
that involve diverse temporal variations. To this end, we propose an approach
to enforce temporal priors on the optimal transport matrix, which leverages
temporal consistency, while allowing for variations in the order of actions.
Our model accounts for both monotonic and non-monotonic sequences and handles
background frames that should not be aligned. We demonstrate that our approach
consistently outperforms the state-of-the-art in self-supervised sequential
action representation learning on four different benchmark datasets.
Related papers
- Contrastive Sequential-Diffusion Learning: An approach to Multi-Scene Instructional Video Synthesis [9.687215124767063]
Action-centric sequence descriptions include non-linear patterns in which the next step may require to be visually consistent not on the immediate previous step but on earlier steps.
We propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene.
Our experiments with real-world data demonstrate the practicality and improved consistency of our model compared to prior work.
arXiv Detail & Related papers (2024-07-16T15:03:05Z) - Made to Order: Discovering monotonic temporal changes via self-supervised video ordering [89.0660110757949]
We exploit a simple proxy task of ordering a shuffled image sequence, with time' serving as a supervisory signal.
We introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps.
arXiv Detail & Related papers (2024-04-25T17:59:56Z) - Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching [17.344430840048094]
Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies.
We develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation.
Unlike the existing methods, we model this task as local matching and global aggregation.
arXiv Detail & Related papers (2024-03-16T01:38:28Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame using three machine vision tools: person detection, pose estimation, and VGG network.
The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW)
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations.
Such a challenging task resides in an open-set setting without prior action detection or segmentation.
We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.