Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
- URL: http://arxiv.org/abs/2404.16828v3
- Date: Tue, 13 Aug 2024 03:41:48 GMT
- Title: Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
- Authors: Charig Yang, Weidi Xie, Andrew Zisserman,
- Abstract summary: We exploit a simple proxy task of ordering a shuffled image sequence, with time' serving as a supervisory signal.
We introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps.
- Score: 89.0660110757949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal, since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple domains covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state-of-the-art on standard benchmarks for image ordering.
Related papers
- JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition [21.039399444257807]
Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization.
We propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework.
We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths.
arXiv Detail & Related papers (2024-03-28T19:11:26Z) - Self-Supervised Temporal Analysis of Spatiotemporal Data [2.2720298829059966]
There exists a correlation between geospatial activity temporal patterns and type of land use.
A novel self-supervised approach is proposed to stratify landscape based on mobility activity time series.
Experiments show that temporal embeddings are semantically meaningful representations of time series data and are effective across different tasks.
arXiv Detail & Related papers (2023-04-25T20:34:38Z) - Uniform Sequence Better: Time Interval Aware Data Augmentation for
Sequential Recommendation [16.00020821220671]
Sequential recommendation is an important task to predict the next-item to access based on a sequence of items.
Most existing works learn user preference as the transition pattern from the previous item to the next one, ignoring the time interval between these two items.
We propose to augment sequence data from the perspective of time interval, which is not studied in the literature.
arXiv Detail & Related papers (2022-12-16T03:13:43Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - DisPositioNet: Disentangled Pose and Identity in Semantic Image
Manipulation [83.51882381294357]
DisPositioNet is a model that learns a disentangled representation for each object for the task of image manipulation using scene graphs.
Our framework enables the disentanglement of the variational latent embeddings as well as the feature representation in the graph.
arXiv Detail & Related papers (2022-11-10T11:47:37Z) - A Generalist Framework for Panoptic Segmentation of Images and Videos [61.61453194912186]
We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task.
A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function.
Our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically.
arXiv Detail & Related papers (2022-10-12T16:18:25Z) - Disentangling Random and Cyclic Effects in Time-Lapse Sequences [32.91054260622378]
We introduce the problem of disentangling time-lapse sequences in a way that allows separate, after-the-fact control of overall trends, cyclic effects, and random effects in the images.
Our approach is based on Generative Adversarial Networks (GAN) that are conditioned with the time coordinate of the time-lapse sequence.
We show that our models are robust to defects in the training data, enabling us to amend some of the practical difficulties in capturing long time-lapse sequences.
arXiv Detail & Related papers (2022-07-04T13:49:04Z) - Learning to Align Sequential Actions in the Wild [123.62879270881807]
We propose an approach to align sequential actions in the wild that involve diverse temporal variations.
Our model accounts for both monotonic and non-monotonic sequences.
We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
arXiv Detail & Related papers (2021-11-17T18:55:36Z) - A Hierarchical Transformation-Discriminating Generative Model for Few
Shot Anomaly Detection [93.38607559281601]
We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image.
The anomaly score is obtained by aggregating the patch-based votes of the correct transformation across scales and image regions.
arXiv Detail & Related papers (2021-04-29T17:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.