Related papers: Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

URL: http://arxiv.org/abs/2207.13820v1
Date: Wed, 27 Jul 2022 22:54:09 GMT
Title: Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers
Authors: Junhyeong Cho, Kim Youwang, Tae-Hyun Oh
Abstract summary: Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction. Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use. We propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO.
Score: 17.22112222736234
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction, but they require a substantial number of parameters and expensive computations. Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use. In this paper, we propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO. We identify the performance bottleneck in the encoder-based transformers is caused by the token design which introduces high complexity interactions among input tokens. We disentangle the interactions via an encoder-decoder architecture, which allows our model to demand much fewer parameters and shorter inference time. In addition, we impose the prior knowledge of human body's morphological relationship via attention masking and mesh upsampling operations, which leads to faster convergence with higher accuracy. Our FastMETRO improves the Pareto-front of accuracy and efficiency, and clearly outperforms image-based methods on Human3.6M and 3DPW. Furthermore, we validate its generalizability on FreiHAND.

Related papers

RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion [6.372261626436676]
RefineFormer3D is a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for medical imaging.<n>The model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments.
arXiv Detail & Related papers (2026-02-18T09:58:59Z)
MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z)
FusionSort: Enhanced Cluttered Waste Segmentation with Advanced Decoding and Comprehensive Modality Optimization [0.17582178425580988]
We introduce an enhanced neural architecture that builds upon an existing-Decoder structure to improve the accuracy and efficiency of waste sorting systems.<n>Our model integrates several key innovations: a Comprehensive Attention Block within the decoder, which refines feature representations by combining convolutional and upsampling operations.<n>We also introduce a Data Fusion Block that fuses images with more than three channels.
arXiv Detail & Related papers (2025-08-27T11:32:59Z)
DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery [2.1653492349540784]
DeforHMR is a novel regression-based monocular HMR framework designed to enhance the prediction of human pose parameters. DeforHMR leverages a novel query-agnostic deformable cross-attention mechanism within the transformer decoder. It achieves state-of-the-art performance for single-frame regression-based methods on the widely used 3D HMR benchmarks 3DPW and RICH.
arXiv Detail & Related papers (2024-11-18T00:46:59Z)
Spiking Transformer Hardware Accelerators in 3D Integration [5.426379844893919]
Spiking neural networks (SNNs) are powerful models of computation and are well suited for on resource-constrained edge devices and neuromorphic hardware. Recently emerged showcased spiking transformers have promising performance and efficiency by capitalizing on the binary nature of spiking operations.
arXiv Detail & Related papers (2024-11-11T22:08:11Z)
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z)
Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image. The proposed approach reduces expected encoder computational cost while maintaining performance. It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z)
SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation [0.13654846342364302]
We present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features. SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features. We benchmark SegFormer3D against the current SOTA models on three widely used datasets.
arXiv Detail & Related papers (2024-04-15T22:12:05Z)
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers [37.14235383028582]
We introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation.
arXiv Detail & Related papers (2023-12-14T17:18:34Z)
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR. UniTR processes a variety of modalities with unified modeling and shared parameters. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z)
TriPlaneNet: An Encoder for EG3D Inversion [1.9567015559455132]
NeRF-based GANs have introduced a number of approaches for high-resolution and high-fidelity generative modeling of human heads. Despite the success of universal optimization-based methods for 2D GAN inversion, those applied to 3D GANs may fail to extrapolate the result onto the novel view. We introduce a fast technique that bridges the gap between the two approaches by directly utilizing the tri-plane representation presented for the EG3D generative model.
arXiv Detail & Related papers (2023-03-23T17:56:20Z)
Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models. We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts. Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z)
Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.