Drop-DTW: Aligning Common Signal Between Sequences While Dropping
Outliers
- URL: http://arxiv.org/abs/2108.11996v1
- Date: Thu, 26 Aug 2021 18:52:35 GMT
- Title: Drop-DTW: Aligning Common Signal Between Sequences While Dropping
Outliers
- Authors: Nikita Dvornik and Isma Hadji and Konstantinos G. Derpanis and Animesh
Garg and Allan D. Jepson
- Abstract summary: We introduce Drop-DTW, a novel algorithm that aligns the common signal between the sequences while automatically dropping the outlier elements from the matching.
In our experiments, we show that Drop-DTW is a robust similarity measure for sequence retrieval and demonstrate its effectiveness as a training loss on diverse applications.
- Score: 33.174893836302005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we consider the problem of sequence-to-sequence alignment for
signals containing outliers. Assuming the absence of outliers, the standard
Dynamic Time Warping (DTW) algorithm efficiently computes the optimal alignment
between two (generally) variable-length sequences. While DTW is robust to
temporal shifts and dilations of the signal, it fails to align sequences in a
meaningful way in the presence of outliers that can be arbitrarily interspersed
in the sequences. To address this problem, we introduce Drop-DTW, a novel
algorithm that aligns the common signal between the sequences while
automatically dropping the outlier elements from the matching. The entire
procedure is implemented as a single dynamic program that is efficient and
fully differentiable. In our experiments, we show that Drop-DTW is a robust
similarity measure for sequence retrieval and demonstrate its effectiveness as
a training loss on diverse applications. With Drop-DTW, we address temporal
step localization on instructional videos, representation learning from noisy
videos, and cross-modal representation learning for audio-visual retrieval and
localization. In all applications, we take a weakly- or unsupervised approach
and demonstrate state-of-the-art results under these settings.
Related papers
- Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.
We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.
We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z) - Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations.
We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - TheGlueNote: Learned Representations for Robust and Flexible Note Alignment [3.997809845676912]
We show how a transformer encoder network - TheGlueNote - predicts pairwise note similarities for two 512 note subsequences.
Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.
arXiv Detail & Related papers (2024-08-08T08:42:30Z) - Deep Declarative Dynamic Time Warping for End-to-End Learning of
Alignment Paths [54.53208538517505]
This paper addresses learning end-to-end models for time series data that include a temporal alignment step via dynamic time warping (DTW)
We propose a DTW layer based around bi-level optimisation and deep declarative networks, which we name DecDTW.
We show that this property is particularly useful for applications where downstream loss functions are defined on the optimal alignment path itself.
arXiv Detail & Related papers (2023-03-19T21:58:37Z) - Approximating DTW with a convolutional neural network on EEG data [9.409281517596396]
We propose a fast and differentiable approximation of Dynamic Time Wrapping (DTW)
We show that our methods achieve at least the same level of accuracy as other DTW main approximations with higher computational efficiency.
arXiv Detail & Related papers (2023-01-30T13:27:47Z) - Scaling Multimodal Pre-Training via Cross-Modality Gradient
Harmonization [68.49738668084693]
Self-supervised pre-training recently demonstrates success on large-scale multimodal data.
Cross-modality alignment (CMA) is only a weak and noisy supervision.
CMA might cause conflicts and biases among modalities.
arXiv Detail & Related papers (2022-11-03T18:12:32Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Learning to Align Sequential Actions in the Wild [123.62879270881807]
We propose an approach to align sequential actions in the wild that involve diverse temporal variations.
Our model accounts for both monotonic and non-monotonic sequences.
We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
arXiv Detail & Related papers (2021-11-17T18:55:36Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Representation Learning via Global Temporal Alignment and
Cycle-Consistency [20.715813546383178]
We introduce a weakly supervised method for representation learning based on aligning temporal sequences.
We report significant performance increases over previous methods.
In addition, we report two applications of our temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.
arXiv Detail & Related papers (2021-05-11T17:34:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.