Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression
- URL: http://arxiv.org/abs/2409.03438v1
- Date: Thu, 5 Sep 2024 11:39:43 GMT
- Title: Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression
- Authors: Ibtissam Saadi, Douglas W. Cunningham, Taleb-ahmed Abdelmalik, Abdenour Hadid, Yassin El Hillali,
- Abstract summary: Existing methods for driver facial expression recognition (DFER) are often computationally intensive, rendering them unsuitable for real-time applications.
We introduce a novel transfer learning-based dual architecture, named ShuffViT-DFER, which elegantly combines computational efficiency and accuracy.
- Score: 4.034679618136641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for driver facial expression recognition (DFER) are often computationally intensive, rendering them unsuitable for real-time applications. In this work, we introduce a novel transfer learning-based dual architecture, named ShuffViT-DFER, which elegantly combines computational efficiency and accuracy. This is achieved by harnessing the strengths of two lightweight and efficient models using convolutional neural network (CNN) and vision transformers (ViT). We efficiently fuse the extracted features to enhance the performance of the model in accurately recognizing the facial expressions of the driver. Our experimental results on two benchmarking and public datasets, KMU-FED and KDEF, highlight the validity of our proposed method for real-time application with superior performance when compared to state-of-the-art methods.
Related papers
- big.LITTLE Vision Transformer for Efficient Visual Recognition [34.015778625984055]
big.LITTLE Vision Transformer is an innovative architecture aimed at achieving efficient visual recognition.
System is composed of two distinct blocks: the big performance block and the LITTLE efficiency block.
When processing an image, our system determines the importance of each token and allocates them accordingly.
arXiv Detail & Related papers (2024-10-14T08:21:00Z) - TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning [6.329214318116305]
We propose a memory-efficient Temporal Difference Side Network ( TDS-CLIP) to balance knowledge transferring and temporal modeling.
Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features.
We also designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos.
arXiv Detail & Related papers (2024-08-20T09:40:08Z) - LeRF: Learning Resampling Function for Adaptive and Efficient Image Interpolation [64.34935748707673]
Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors.
We propose a novel method of Learning Resampling (termed LeRF) which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption.
LeRF assigns spatially varying resampling functions to input image pixels and learns to predict the shapes of these resampling functions with a neural network.
arXiv Detail & Related papers (2024-07-13T16:09:45Z) - TransAxx: Efficient Transformers with Approximate Computing [4.347898144642257]
Vision Transformer (ViT) models have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs)
We propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic.
Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations.
arXiv Detail & Related papers (2024-02-12T10:16:05Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - ParaFormer: Parallel Attention Transformer for Efficient Feature
Matching [8.552303361149612]
This paper proposes a novel parallel attention model entitled ParaFormer.
It fuses features and keypoint positions through the concept of amplitude and phase, and integrates self- and cross-attention in a parallel manner.
Experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that ParaFormer achieves state-of-the-art performance.
The efficient ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of the existing attention-based models.
arXiv Detail & Related papers (2023-03-02T03:29:16Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - An Efficient and Scalable Collection of Fly-inspired Voting Units for
Visual Place Recognition in Changing Environments [20.485491385050615]
Low-overhead VPR techniques would enable platforms equipped with low-end, cheap hardware.
Our goal is to provide an algorithm of extreme compactness and efficiency while achieving state-of-the-art robustness to appearance changes and small point-of-view variations.
arXiv Detail & Related papers (2021-09-22T19:01:20Z) - Dynamic Network Quantization for Efficient Video Inference [60.109250720206425]
We propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition.
We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency.
arXiv Detail & Related papers (2021-08-23T20:23:57Z) - AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.