Related papers: ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition

ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition

URL: http://arxiv.org/abs/2411.07118v2
Date: Sun, 17 Nov 2024 18:58:41 GMT
Title: ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition
Authors: Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan,
Abstract summary: We explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets. Our model has achieved state-of-the-art results on single and multimodal inputs.
Score: 5.311735227179715
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Transformer models have demonstrated remarkable success in many domains such as natural language processing (NLP) and computer vision. With the growing interest in transformer-based architectures, they are now utilized for gesture recognition. So, we also explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The transformers use quadratic scaling of the attention features with the sequential data, due to which these models are computationally complex and heavy. We have considered this drawback of the transformer and designed a resource-efficient model that replaces the self-attention in the transformer with the simple convolutional layer-based token mixer. The computational cost and the parameters used for the convolution-based mixer are comparatively less than the quadratic self-attention. Convolution-mixer helps the model capture the local spatial features that self-attention struggles to capture due to their sequential processing nature. Further, an efficient gate mechanism is employed instead of a conventional feed-forward network in the transformer to help the model control the flow of features within different stages of the proposed model. This design uses fewer learnable parameters which is nearly half the vanilla transformer that helps in fast and efficient training. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has achieved state-of-the-art results on single and multimodal inputs. We have also shown the parameter efficiency of the proposed ConvMixFormer model compared to other methods. The source code is available at https://github.com/mallikagarg/ConvMixFormer.

Related papers

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z)
Converting Transformers into DGNNs Form [3.7468283401703797]
We introduce a synthetic unitary digraph convolution based on the digraph Fourier transform. The resulting model, which we term Converter, effectively converts a Transformer into a Directed Graph Neural Network form. We have tested Converter on Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification.
arXiv Detail & Related papers (2025-02-01T22:44:46Z)
Merging Feed-Forward Sublayers for Compressed Transformers [16.746335565636976]
We present a novel approach to model compression by merging similar parameter groups within a model. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models. We demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers.
arXiv Detail & Related papers (2025-01-10T17:25:11Z)
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs) Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition [5.311735227179715]
Transformer model have achieved state-of-the-art results in many applications like NLP, classification, etc. We propose a novel GestFormer architecture for dynamic hand gesture recognition.
arXiv Detail & Related papers (2024-05-18T05:16:32Z)
Shifted Chunk Transformer for Spatio-Temporal Representational Learning [24.361059477031162]
We construct a shifted chunk Transformer with pure self-attention blocks. This Transformer can learn hierarchical-temporal features from a tiny patch to a global video clip. It outperforms state-of-the-art approaches on Kinetics, Kinetics-600, UCF101, and HMDB51.
arXiv Detail & Related papers (2021-08-26T04:34:33Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
Parameter Efficient Multimodal Transformers for Video Representation Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z)
Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. In this paper, we explore how to apply mixup to natural language processing tasks. We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.