When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model
- URL: http://arxiv.org/abs/2105.13150v1
- Date: Thu, 27 May 2021 13:51:42 GMT
- Title: When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model
- Authors: Haibo Jin, Jinpeng Li, Shengcai Liao, Ling Shao
- Abstract summary: We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
- Score: 87.25037167380522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, significant progress has been made in the research of facial
landmark detection. However, few prior works have thoroughly discussed about
models for practical applications. Instead, they often focus on improving a
couple of issues at a time while ignoring the others. To bridge this gap, we
aim to explore a practical model that is accurate, robust, efficient,
generalizable, and end-to-end trainable at the same time. To this end, we first
propose a baseline model equipped with one transformer decoder as detection
head. In order to achieve a better accuracy, we further propose two lightweight
modules, namely dynamic query initialization (DQInit) and query-aware memory
(QAMem). Specifically, DQInit dynamically initializes the queries of decoder
from the inputs, enabling the model to achieve as good accuracy as the ones
with multiple decoder layers. QAMem is designed to enhance the discriminative
ability of queries on low-resolution feature maps by assigning separate memory
values to each query rather than a shared one. With the help of QAMem, our
model removes the dependence on high-resolution feature maps and is still able
to obtain superior accuracy. Extensive experiments and analysis on three
popular benchmarks show the effectiveness and practical advantages of the
proposed model. Notably, our model achieves new state of the art on WFLW as
well as competitive results on 300W and COFW, while still running at 50+ FPS.
Related papers
- Data-Driven Approaches for Modelling Target Behaviour [1.5495593104596401]
The performance of tracking algorithms depends on the chosen model assumptions regarding the target dynamics.
This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion.
arXiv Detail & Related papers (2024-10-14T14:18:27Z) - A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z) - Decoupled DETR For Few-shot Object Detection [4.520231308678286]
We improve the FSOD model to address the severe issue of sample imbalance and weak feature propagation.
We build a unified decoder module that could dynamically fuse the decoder layers as the output feature.
Our results indicate that our proposed module could achieve stable improvements of 5% to 10% in both fine-tuning and meta-learning paradigms.
arXiv Detail & Related papers (2023-11-20T07:10:39Z) - Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders
using Hierarchical Maps Distillation [16.04961815178485]
We propose a lightweight model that employs multiple simple heterogeneous decoders.
Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods.
arXiv Detail & Related papers (2023-01-11T18:20:19Z) - Learning to Fit Morphable Models [12.469605679847085]
We build upon recent advances in learned optimization and propose an update rule inspired by the classic Levenberg-Marquardt algorithm.
We show the effectiveness of the proposed neural on the problems of 3D body surface estimation from a head-mounted device and face fitting from 2D landmarks.
arXiv Detail & Related papers (2021-11-29T18:59:53Z) - Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images.
To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN.
In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.