Related papers: Attention Isn't All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset

Attention Isn't All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset

URL: http://arxiv.org/abs/2601.22161v2
Date: Mon, 02 Feb 2026 11:50:11 GMT
Title: Attention Isn't All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset
Authors: Anmol Guragain,
Abstract summary: We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3)<n>Experiments show that sophisticated attention mechanisms consistently underperform on small datasets.
Score: 0.2538209532048867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9% to 65.56% (+3.66pp), while frequency-domain features for EEG achieved 67.62% (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached 75.30%, exceeding the paper's ViViT result (74.5%) through domain-specific pretraining, and vision delta features achieved 72.68% (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.

Related papers

Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets [0.0]
This study explores human recognition using a three-class subset of the COCO image corpus.<n>The binary Vision Transformer (ViT) achieved 90% mean test accuracy.
arXiv Detail & Related papers (2025-06-13T11:16:50Z)
Efficient Leaf Disease Classification and Segmentation using Midpoint Normalization Technique and Attention Mechanism [0.0]
We introduce a transformative two-stage methodology, Mid Point Normalization (MPN) for intelligent image preprocessing.<n>Our classification pipeline achieves 93% accuracy while maintaining exceptional class-wise balance.<n>For segmentation tasks, we seamlessly integrate identical attention blocks within U-Net architecture using MPN-enhanced inputs.
arXiv Detail & Related papers (2025-05-27T15:14:04Z)
An Enhancement of CNN Algorithm for Rice Leaf Disease Image Classification in Mobile Applications [0.0]
This study focuses on enhancing rice leaf disease image classification algorithms, which have traditionally relied on Convolutional Neural Network (CNN) models.<n>We employed transfer learning with MobileViTV2_050 using ImageNet-1k weights, a lightweight model that integrates CNN's local feature extraction with Vision Transformers' global context learning.<n>Our approach resulted in a significant 15.66% improvement in classification accuracy for MobileViTV2_050-A, our first enhanced model trained on the baseline dataset, achieving 93.14%.
arXiv Detail & Related papers (2024-12-10T04:41:10Z)
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition. DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z)
One-Shot Learning for Periocular Recognition: Exploring the Effect of Domain Adaptation and Data Bias on Deep Representations [59.17685450892182]
We investigate the behavior of deep representations in widely used CNN models under extreme data scarcity for One-Shot periocular recognition. We improved state-of-the-art results that made use of networks trained with biometric datasets with millions of images. Traditional algorithms like SIFT can outperform CNNs in situations with limited data.
arXiv Detail & Related papers (2023-07-11T09:10:16Z)
Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z)
Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z)
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm [111.17100512647619]
This paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) We propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach.
arXiv Detail & Related papers (2022-06-19T04:49:35Z)
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)
VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification. We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO) Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z)
Understanding the Role of Affect Dimensions in Detecting Emotions from Tweets: A Multi-task Approach [14.725717500450623]
We propose VADEC, a framework that exploits the correlation between the categorical and dimensional models of emotion representation for better subjectivity analysis. We jointly train multi-label emotion classification and multi-dimensional emotion regression, thereby utilizing the inter-relatedness between the tasks.
arXiv Detail & Related papers (2021-05-09T18:07:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.