Recurrence without Recurrence: Stable Video Landmark Detection with Deep
Equilibrium Models
- URL: http://arxiv.org/abs/2304.00600v1
- Date: Sun, 2 Apr 2023 19:08:02 GMT
- Title: Recurrence without Recurrence: Stable Video Landmark Detection with Deep
Equilibrium Models
- Authors: Paul Micaelli, Arash Vahdat, Hongxu Yin, Jan Kautz, Pavlo Molchanov
- Abstract summary: We show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation.
Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the WFLW facial landmark dataset.
- Score: 96.76758318732308
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cascaded computation, whereby predictions are recurrently refined over
several stages, has been a persistent theme throughout the development of
landmark detection models. In this work, we show that the recently proposed
Deep Equilibrium Model (DEQ) can be naturally adapted to this form of
computation. Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on
the challenging WFLW facial landmark dataset, reaching $3.92$ NME with fewer
parameters and a training memory cost of $\mathcal{O}(1)$ in the number of
recurrent modules. Furthermore, we show that DEQs are particularly suited for
landmark detection in videos. In this setting, it is typical to train on still
images due to the lack of labelled videos. This can lead to a ``flickering''
effect at inference time on video, whereby a model can rapidly oscillate
between different plausible solutions across consecutive frames. By rephrasing
DEQs as a constrained optimization, we emulate recurrence at inference time,
despite not having access to temporal data at training time. This Recurrence
without Recurrence (RwR) paradigm helps in reducing landmark flicker, which we
demonstrate by introducing a new metric, normalized mean flicker (NMF), and
contributing a new facial landmark video dataset (WFLW-V) targeting landmark
uncertainty. On the WFLW-V hard subset made up of $500$ videos, our LDEQ with
RwR improves the NME and NMF by $10$ and $13\%$ respectively, compared to the
strongest previously published model using a hand-tuned conventional filter.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models [52.454274602380124]
Diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising.
We propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block.
Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features.
arXiv Detail & Related papers (2023-11-27T12:59:52Z) - Progressive Fourier Neural Representation for Sequential Video
Compilation [75.43041679717376]
Motivated by continual learning, this work investigates how to accumulate and transfer neural implicit representations for multiple complex video data over sequential encoding sessions.
We propose a novel method, Progressive Fourier Neural Representation (PFNR), that aims to find an adaptive and compact sub-module in Fourier space to encode videos in each training session.
We validate our PFNR method on the UVG8/17 and DAVIS50 video sequence benchmarks and achieve impressive performance gains over strong continual learning baselines.
arXiv Detail & Related papers (2023-06-20T06:02:19Z) - Making Reconstruction-based Method Great Again for Video Anomaly
Detection [64.19326819088563]
Anomaly detection in videos is a significant yet challenging problem.
Existing reconstruction-based methods rely on old-fashioned convolutional autoencoders.
We propose a new autoencoder model for enhanced consecutive frame reconstruction.
arXiv Detail & Related papers (2023-01-28T01:57:57Z) - Representation Recycling for Streaming Video Analysis [19.068248496174903]
StreamDEQ aims to infer frame-wise representations on videos with minimal per-frame computation.
We show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration.
arXiv Detail & Related papers (2022-04-28T13:35:14Z) - A Hierarchical Variational Neural Uncertainty Model for Stochastic Video
Prediction [45.6432265855424]
We introduce Neural Uncertainty Quantifier (NUQ) - a principled quantification of the model's predictive uncertainty.
Our proposed framework trains more effectively compared to the state-of-theart models.
arXiv Detail & Related papers (2021-10-06T00:25:22Z) - VAE^2: Preventing Posterior Collapse of Variational Video Predictions in
the Wild [131.58069944312248]
We propose a novel VAE structure, dabbed VAE-in-VAE or VAE$2$.
We treat part of the observed video sequence as a random transition state that bridges its past and future, and maximize the likelihood of a Markov Chain over the video sequence under all possible transition states.
VAE$2$ can mitigate the posterior collapse problem to a large extent, as it breaks the direct dependence between future and observation and does not directly regress the determinate future provided by the training data.
arXiv Detail & Related papers (2021-01-28T15:06:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.