Representation Learning via Global Temporal Alignment and
Cycle-Consistency
- URL: http://arxiv.org/abs/2105.05217v1
- Date: Tue, 11 May 2021 17:34:04 GMT
- Title: Representation Learning via Global Temporal Alignment and
Cycle-Consistency
- Authors: Isma Hadji, Konstantinos G. Derpanis, Allan D. Jepson
- Abstract summary: We introduce a weakly supervised method for representation learning based on aligning temporal sequences.
We report significant performance increases over previous methods.
In addition, we report two applications of our temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.
- Score: 20.715813546383178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a weakly supervised method for representation learning based on
aligning temporal sequences (e.g., videos) of the same process (e.g., human
action). The main idea is to use the global temporal ordering of latent
correspondences across sequence pairs as a supervisory signal. In particular,
we propose a loss based on scoring the optimal sequence alignment to train an
embedding network. Our loss is based on a novel probabilistic path finding view
of dynamic time warping (DTW) that contains the following three key features:
(i) the local path routing decisions are contrastive and differentiable, (ii)
pairwise distances are cast as probabilities that are contrastive as well, and
(iii) our formulation naturally admits a global cycle consistency loss that
verifies correspondences. For evaluation, we consider the tasks of fine-grained
action classification, few shot learning, and video synchronization. We report
significant performance increases over previous methods. In addition, we report
two applications of our temporal alignment framework, namely 3D pose
reconstruction and fine-grained audio/visual retrieval.
Related papers
- Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.
We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.
We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z) - Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations.
We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - Intensity Profile Projection: A Framework for Continuous-Time
Representation Learning for Dynamic Networks [50.2033914945157]
We present a representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data.
The framework consists of three stages: estimating pairwise intensity functions, learning a projection which minimises a notion of intensity reconstruction error.
Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses.
arXiv Detail & Related papers (2023-06-09T15:38:25Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Skeleton-based Action Recognition through Contrasting Two-Stream
Spatial-Temporal Networks [11.66009967197084]
We propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way.
We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.
arXiv Detail & Related papers (2023-01-27T02:12:08Z) - Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action
Recognition [38.27785891922479]
Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2022-10-30T11:46:38Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem
Formulation [53.850686395708905]
Event-based cameras record an asynchronous stream of per-pixel brightness changes.
In this paper, we focus on single-layer architectures for representation learning from event data.
We show improvements of up to 9 % in the recognition accuracy compared to the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-23T10:40:03Z) - Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence
Lip-Reading [96.48553941812366]
Lip-reading aims to infer the speech content from the lip movement sequence.
Traditional learning process of seq2seq models suffers from two problems.
We propose a novel pseudo-convolutional policy gradient (PCPG) based method to address these two problems.
arXiv Detail & Related papers (2020-03-09T09:12:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.