Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models
- URL: http://arxiv.org/abs/2508.13524v1
- Date: Tue, 19 Aug 2025 05:33:10 GMT
- Title: Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models
- Authors: Vamsi Krishna Mulukutla, Sai Supriya Pavarala, Srinivasa Raju Rudraraju, Sridevi Bonthu,
- Abstract summary: Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics.<n>This study presents the first empirical comparison of open-source Vision-Language Models (VLMs) against traditional deep learning models.<n>To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.
Related papers
- Facial Emotion Recognition on FER-2013 using an EfficientNetB2-Based Approach [0.0]
Detection of human emotions based on facial images in real-world scenarios is a difficult task due to low image quality, variations in lighting, pose changes, background distractions, small inter-class variations, noisy crowd-sourced labels, and severe class imbalance.<n>We address these challenges using a lightweight and efficient facial emotion recognition pipeline based on EfficientNetB2.<n>The model is trained using a stratified 87.5%/12.5% train-validation split while keeping the official test set intact, achieving a test accuracy of 68.78% with nearly ten times fewer parameters than VGG16-based baselines.
arXiv Detail & Related papers (2026-01-26T07:29:50Z) - VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations [40.10675156646689]
multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images.<n>No systematic benchmark exists for measuring their capabilities in evaluating visualizations.<n>We propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment.
arXiv Detail & Related papers (2025-10-25T17:31:02Z) - HoneyBee: Data Recipes for Vision-Language Reasoners [90.83745691506329]
We introduce several data curation approaches and study their impacts on vision-language models (VLMs)<n>We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions.<n>Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs.
arXiv Detail & Related papers (2025-10-14T07:23:44Z) - EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving [61.99289768925256]
EvaLearn is a benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.<n>We benchmark nine frontier models and observe varied performance profiles.<n>We observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks.
arXiv Detail & Related papers (2025-06-03T09:18:33Z) - ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [37.54872845368151]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z) - Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [92.99416966226724]
We introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms.<n>We apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels.<n>Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance.
arXiv Detail & Related papers (2024-11-05T23:26:10Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.