Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery
- URL: http://arxiv.org/abs/2309.01943v1
- Date: Tue, 5 Sep 2023 04:18:03 GMT
- Title: Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery
- Authors: JoonKyu Park, Daniel Sungho Jung, Gyeongsik Moon, Kyoung Mu Lee
- Abstract summary: We present EANet, extract-and-adaptation network, with EABlock, the main component of our network.
Our two novel tokens are from a combination of separated two hand features; hence, it is much more robust to the distant token problem.
The proposed EANet achieves the state-of-the-art performance on 3D interacting hands benchmarks.
- Score: 64.37035857740781
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding how two hands interact with each other is a key component of
accurate 3D interacting hand mesh recovery. However, recent Transformer-based
methods struggle to learn the interaction between two hands as they directly
utilize two hand features as input tokens, which results in distant token
problem. The distant token problem represents that input tokens are in
heterogeneous spaces, leading Transformer to fail in capturing correlation
between input tokens. Previous Transformer-based methods suffer from the
problem especially when poses of two hands are very different as they project
features from a backbone to separate left and right hand-dedicated features. We
present EANet, extract-and-adaptation network, with EABlock, the main component
of our network. Rather than directly utilizing two hand features as input
tokens, our EABlock utilizes two complementary types of novel tokens, SimToken
and JoinToken, as input tokens. Our two novel tokens are from a combination of
separated two hand features; hence, it is much more robust to the distant token
problem. Using the two type of tokens, our EABlock effectively extracts
interaction feature and adapts it to each hand. The proposed EANet achieves the
state-of-the-art performance on 3D interacting hands benchmarks. The codes are
available at https://github.com/jkpark0825/EANet.
Related papers
- OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer [35.983309206845036]
We introduce OmniHands, a universal approach to recovering interactive hand meshes and their relative movement from monocular or multi-view inputs.
We develop a universal architecture with novel tokenization and contextual feature fusion strategies.
The efficacy of our approach is validated on several benchmark datasets.
arXiv Detail & Related papers (2024-05-30T17:59:02Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Loss Masking Is Not Needed in Decoder-only Transformer for
Discrete-token-based ASR [58.136778669618096]
unified speech-text models have achieved remarkable performance on various speech tasks.
We propose to model speech tokens in an autoregressive way, similar to text.
We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance.
arXiv Detail & Related papers (2023-11-08T08:45:14Z) - Explaining Interactions Between Text Spans [50.70253702800355]
Reasoning over spans of tokens from different parts of the input is essential for natural language understanding.
We introduce SpanEx, a dataset of human span interaction explanations for two NLU tasks: NLI and FC.
We then investigate the decision-making processes of multiple fine-tuned large language models in terms of the employed connections between spans.
arXiv Detail & Related papers (2023-10-20T13:52:37Z) - Dynamic Token-Pass Transformers for Semantic Segmentation [22.673910995773262]
We introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation.
DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria.
Our method greatly reduces about 40% $sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers.
arXiv Detail & Related papers (2023-08-03T06:14:24Z) - How can objects help action recognition? [74.29564964727813]
We investigate how we can use knowledge of objects to design better video models.
First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens.
Second, we propose an object-aware attention module that enriches our feature representation with object information.
arXiv Detail & Related papers (2023-06-20T17:56:16Z) - Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions.
We propose two techniques to make attention more stable through two general techniques.
First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism.
Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z) - Compound Tokens: Channel Fusion for Vision-Language Representation
Learning [36.19486792701684]
We present an effective method for fusing visual-and-language representations for question answering tasks.
By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods.
We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting.
arXiv Detail & Related papers (2022-12-02T21:09:52Z) - SWAT: Spatial Structure Within and Among Tokens [53.525469741515884]
We argue that models can have significant gains when spatial structure is preserved during tokenization.
We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
arXiv Detail & Related papers (2021-11-26T18:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.