Knowing When to Quit: Selective Cascaded Regression with Patch Attention
for Real-Time Face Alignment
- URL: http://arxiv.org/abs/2108.00377v2
- Date: Tue, 3 Aug 2021 07:21:08 GMT
- Title: Knowing When to Quit: Selective Cascaded Regression with Patch Attention
for Real-Time Face Alignment
- Authors: Gil Shapira, Noga Levy, Ishay Goldin, Roy J. Jevnisek
- Abstract summary: We show that frontal faces with neutral expressions converge faster than faces with extreme poses or expressions.
We offer a multi-scale, patch-based, lightweight feature extractor with a fine-grained local patch attention module.
Our model runs in real-time on a mobile device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming all state-of-the-art methods under 1000 MMA.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Facial landmarks (FLM) estimation is a critical component in many
face-related applications. In this work, we aim to optimize for both accuracy
and speed and explore the trade-off between them. Our key observation is that
not all faces are created equal. Frontal faces with neutral expressions
converge faster than faces with extreme poses or expressions. To differentiate
among samples, we train our model to predict the regression error after each
iteration. If the current iteration is accurate enough, we stop iterating,
saving redundant iterations while keeping the accuracy in check. We also
observe that as neighboring patches overlap, we can infer all facial landmarks
(FLMs) with only a small number of patches without a major accuracy sacrifice.
Architecturally, we offer a multi-scale, patch-based, lightweight feature
extractor with a fine-grained local patch attention module, which computes a
patch weighting according to the information in the patch itself and enhances
the expressive power of the patch features. We analyze the patch attention data
to infer where the model is attending when regressing facial landmarks and
compare it to face attention in humans. Our model runs in real-time on a mobile
device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming all
state-of-the-art methods under 1000 MMA, with a normalized mean error of 8.16
on the 300W challenging dataset.
Related papers
- Learning to Embed Time Series Patches Independently [5.752266579415516]
Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series.
We argue that capturing such patch might not be an optimal strategy for time series representation learning.
We propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise reconstruction that embeds each patch independently.
arXiv Detail & Related papers (2023-12-27T06:23:29Z) - Bootstrap Masked Visual Modeling via Hard Patches Mining [68.74750345823674]
Masked visual modeling has attracted much attention due to its promising potential in learning generalizable representations.
We argue that it is equally important for the model to stand in the shoes of a teacher to produce challenging problems by itself.
To empower the model as a teacher, we propose Hard Patches Mining (HPM), predicting patch-wise losses and subsequently determining where to mask.
arXiv Detail & Related papers (2023-12-21T10:27:52Z) - PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape
Prediction [77.89935657608926]
We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images.
PF-LRM simultaneously estimates the relative camera poses in 1.3 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-11-20T18:57:55Z) - Fixing Model Bugs with Natural Language Patches [38.67529353406759]
We explore natural language patches that allow developers to provide corrective feedback at the right level of abstraction.
We show that with a small amount of synthetic data, we can teach models to effectively use real patches on real data.
We also show that finetuning on as many as 100 labeled examples may be needed to match the performance of a small set of language patches.
arXiv Detail & Related papers (2022-11-07T05:49:19Z) - Accelerating Vision Transformer Training via a Patch Sampling Schedule [0.685316573653194]
We introduce the notion of a Patch Sampling Schedule (PSS)
PSS varies the number of Vision Transformer (ViT) patches used per batch during training.
We observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference.
arXiv Detail & Related papers (2022-08-19T19:16:46Z) - Patching open-vocabulary models by interpolating weights [85.12977566514984]
Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks.
We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate.
Our findings demonstrate that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch.
arXiv Detail & Related papers (2022-08-10T23:47:43Z) - Subpixel Heatmap Regression for Facial Landmark Localization [65.41270740933656]
Heatmap regression approaches suffer from discretization-induced errors related to both the heatmap encoding and decoding process.
We propose a new approach for the heatmap encoding and decoding process by leveraging the underlying continuous distribution.
Our approach offers noticeable gains across multiple datasets setting a new state-of-the-art result in facial landmark localization.
arXiv Detail & Related papers (2021-11-03T17:21:28Z) - Accurate, Interpretable, and Fast Animation: AnIterative, Sparse, and
Nonconvex Approach [0.9176056742068814]
A face rig must be accurate and, at the same time, compute fast to solve the problem.
One of the parameters at each common animation model is a sparsity regularization.
In order to reduce the complexity, a paradigm Majorization Mini (MM) is applied.
arXiv Detail & Related papers (2021-09-17T05:42:07Z) - Rethinking Generative Zero-Shot Learning: An Ensemble Learning
Perspective for Recognising Visual Patches [52.67723703088284]
We propose a novel framework called multi-patch generative adversarial nets (MPGAN)
MPGAN synthesises local patch features and labels unseen classes with a novel weighted voting strategy.
MPGAN has significantly greater accuracy than state-of-the-art methods.
arXiv Detail & Related papers (2020-07-27T05:49:44Z) - Pixel-in-Pixel Net: Towards Efficient Facial Landmark Detection in the
Wild [104.61677518999976]
We propose Pixel-in-Pixel Net (PIPNet) for facial landmark detection.
The proposed model is equipped with a novel detection head based on heatmap regression.
To further improve the cross-domain generalization capability of PIPNet, we propose self-training with curriculum.
arXiv Detail & Related papers (2020-03-08T12:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.