Related papers: Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery

Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery

URL: http://arxiv.org/abs/2309.01943v1
Date: Tue, 5 Sep 2023 04:18:03 GMT
Title: Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery
Authors: JoonKyu Park, Daniel Sungho Jung, Gyeongsik Moon, Kyoung Mu Lee
Abstract summary: We present EANet, extract-and-adaptation network, with EABlock, the main component of our network. Our two novel tokens are from a combination of separated two hand features; hence, it is much more robust to the distant token problem. The proposed EANet achieves the state-of-the-art performance on 3D interacting hands benchmarks.
Score: 64.37035857740781
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Understanding how two hands interact with each other is a key component of accurate 3D interacting hand mesh recovery. However, recent Transformer-based methods struggle to learn the interaction between two hands as they directly utilize two hand features as input tokens, which results in distant token problem. The distant token problem represents that input tokens are in heterogeneous spaces, leading Transformer to fail in capturing correlation between input tokens. Previous Transformer-based methods suffer from the problem especially when poses of two hands are very different as they project features from a backbone to separate left and right hand-dedicated features. We present EANet, extract-and-adaptation network, with EABlock, the main component of our network. Rather than directly utilizing two hand features as input tokens, our EABlock utilizes two complementary types of novel tokens, SimToken and JoinToken, as input tokens. Our two novel tokens are from a combination of separated two hand features; hence, it is much more robust to the distant token problem. Using the two type of tokens, our EABlock effectively extracts interaction feature and adapts it to each hand. The proposed EANet achieves the state-of-the-art performance on 3D interacting hands benchmarks. The codes are available at https://github.com/jkpark0825/EANet.

Related papers

OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer [35.983309206845036]
We introduce OmniHands, a universal approach to recovering interactive hand meshes and their relative movement from monocular or multi-view inputs. We develop a universal architecture with novel tokenization and contextual feature fusion strategies. The efficacy of our approach is validated on several benchmark datasets.
arXiv Detail & Related papers (2024-05-30T17:59:02Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR [58.136778669618096]
unified speech-text models have achieved remarkable performance on various speech tasks. We propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance.
arXiv Detail & Related papers (2023-11-08T08:45:14Z)
Explaining Interactions Between Text Spans [50.70253702800355]
Reasoning over spans of tokens from different parts of the input is essential for natural language understanding. We introduce SpanEx, a dataset of human span interaction explanations for two NLU tasks: NLI and FC. We then investigate the decision-making processes of multiple fine-tuned large language models in terms of the employed connections between spans.
arXiv Detail & Related papers (2023-10-20T13:52:37Z)
Dynamic Token-Pass Transformers for Semantic Segmentation [22.673910995773262]
We introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. Our method greatly reduces about 40% $sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers.
arXiv Detail & Related papers (2023-08-03T06:14:24Z)
How can objects help action recognition? [74.29564964727813]
We investigate how we can use knowledge of objects to design better video models. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens. Second, we propose an object-aware attention module that enriches our feature representation with object information.
arXiv Detail & Related papers (2023-06-20T17:56:16Z)
Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions. We propose two techniques to make attention more stable through two general techniques. First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z)
Compound Tokens: Channel Fusion for Vision-Language Representation Learning [36.19486792701684]
We present an effective method for fusing visual-and-language representations for question answering tasks. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting.
arXiv Detail & Related papers (2022-12-02T21:09:52Z)
SWAT: Spatial Structure Within and Among Tokens [53.525469741515884]
We argue that models can have significant gains when spatial structure is preserved during tokenization. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
arXiv Detail & Related papers (2021-11-26T18:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.