Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery
with Transformers
- URL: http://arxiv.org/abs/2207.13820v1
- Date: Wed, 27 Jul 2022 22:54:09 GMT
- Title: Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery
with Transformers
- Authors: Junhyeong Cho, Kim Youwang, Tae-Hyun Oh
- Abstract summary: Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction.
Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use.
We propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO.
- Score: 17.22112222736234
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer encoder architectures have recently achieved state-of-the-art
results on monocular 3D human mesh reconstruction, but they require a
substantial number of parameters and expensive computations. Due to the large
memory overhead and slow inference speed, it is difficult to deploy such models
for practical use. In this paper, we propose a novel transformer
encoder-decoder architecture for 3D human mesh reconstruction from a single
image, called FastMETRO. We identify the performance bottleneck in the
encoder-based transformers is caused by the token design which introduces high
complexity interactions among input tokens. We disentangle the interactions via
an encoder-decoder architecture, which allows our model to demand much fewer
parameters and shorter inference time. In addition, we impose the prior
knowledge of human body's morphological relationship via attention masking and
mesh upsampling operations, which leads to faster convergence with higher
accuracy. Our FastMETRO improves the Pareto-front of accuracy and efficiency,
and clearly outperforms image-based methods on Human3.6M and 3DPW. Furthermore,
we validate its generalizability on FreiHAND.
Related papers
- SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation [0.13654846342364302]
We present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features.
SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features.
We benchmark SegFormer3D against the current SOTA models on three widely used datasets.
arXiv Detail & Related papers (2024-04-15T22:12:05Z) - Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D
Reconstruction with Transformers [37.14235383028582]
We introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference.
Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation.
arXiv Detail & Related papers (2023-12-14T17:18:34Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Unfolding Framework with Prior of Convolution-Transformer Mixture and
Uncertainty Estimation for Video Snapshot Compressive Imaging [7.601695814245209]
We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement.
By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems.
arXiv Detail & Related papers (2023-06-20T06:25:48Z) - TriPlaneNet: An Encoder for EG3D Inversion [1.9567015559455132]
NeRF-based GANs have introduced a number of approaches for high-resolution and high-fidelity generative modeling of human heads.
Despite the success of universal optimization-based methods for 2D GAN inversion, those applied to 3D GANs may fail to extrapolate the result onto the novel view.
We introduce a fast technique that bridges the gap between the two approaches by directly utilizing the tri-plane representation presented for the EG3D generative model.
arXiv Detail & Related papers (2023-03-23T17:56:20Z) - I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs.
We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z) - Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.