Related papers: MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

URL: http://arxiv.org/abs/2409.03890v1
Date: Thu, 5 Sep 2024 19:55:38 GMT
Title: MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition
Authors: Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan,
Abstract summary: We introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters.
Score: 5.311735227179715
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at https://github.com/mallikagarg/MVTN.

Related papers

MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion [2.7745600113170994]
We introduce the MultiSensor-Home dataset, a novel benchmark for comprehensive action recognition in home environments.<n>We also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method.
arXiv Detail & Related papers (2025-04-03T05:23:08Z)
Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition [5.311735227179715]
Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. Experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22% and 99.10% on NVGesture and Briareo datasets.
arXiv Detail & Related papers (2025-01-01T19:26:32Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity. RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition [5.311735227179715]
Transformer model have achieved state-of-the-art results in many applications like NLP, classification, etc. We propose a novel GestFormer architecture for dynamic hand gesture recognition.
arXiv Detail & Related papers (2024-05-18T05:16:32Z)
MMViT: Multiscale Multiview Vision Transformers [36.93551299085767]
We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models. Our model encodes different views of the input signal and builds several channel-resolution feature stages to process the multiple views of the input at different resolutions in parallel. We demonstrate the effectiveness of MMViT on audio and image classification tasks, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-04-28T21:51:41Z)
MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. MVTN can be trained end-to-end with any multi-view network for 3D shape recognition. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z)
Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR) With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Progressive Multi-stage Interactive Training in Mobile Network for Fine-grained Recognition [8.727216421226814]
We propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI) First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases. Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages. Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.
arXiv Detail & Related papers (2021-12-08T10:50:03Z)
Parallel mesh reconstruction streams for pose estimation of interacting hands [2.0305676256390934]
We present a new multi-stream 3D mesh reconstruction network (MSMR-Net) for hand pose estimation from a single RGB image. Our model consists of an image encoder followed by a mesh-convolution decoder composed of connected graph convolution layers.
arXiv Detail & Related papers (2021-04-25T10:14:15Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z)
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [74.19291916812921]
forged images generated by Deepfake techniques pose a serious threat to the trustworthiness of digital information. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods.
arXiv Detail & Related papers (2021-04-20T05:43:44Z)
Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts. We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.