Towards In-Vehicle Multi-Task Facial Attribute Recognition:
Investigating Synthetic Data and Vision Foundation Models
- URL: http://arxiv.org/abs/2403.06088v1
- Date: Sun, 10 Mar 2024 04:17:54 GMT
- Title: Towards In-Vehicle Multi-Task Facial Attribute Recognition:
Investigating Synthetic Data and Vision Foundation Models
- Authors: Esmaeil Seraj and Walter Talamonti
- Abstract summary: We investigate the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle.
Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context.
- Score: 8.54530542456452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the burgeoning field of intelligent transportation systems, enhancing
vehicle-driver interaction through facial attribute recognition, such as facial
expression, eye gaze, age, etc., is of paramount importance for safety,
personalization, and overall user experience. However, the scarcity of
comprehensive large-scale, real-world datasets poses a significant challenge
for training robust multi-task models. Existing literature often overlooks the
potential of synthetic datasets and the comparative efficacy of
state-of-the-art vision foundation models in such constrained settings. This
paper addresses these gaps by investigating the utility of synthetic datasets
for training complex multi-task models that recognize facial attributes of
passengers of a vehicle, such as gaze plane, age, and facial expression.
Utilizing transfer learning techniques with both pre-trained Vision Transformer
(ViT) and Residual Network (ResNet) models, we explore various training and
adaptation methods to optimize performance, particularly when data availability
is limited. We provide extensive post-evaluation analysis, investigating the
effects of synthetic data distributions on model performance in in-distribution
data and out-of-distribution inference. Our study unveils counter-intuitive
findings, notably the superior performance of ResNet over ViTs in our specific
multi-task context, which is attributed to the mismatch in model complexity
relative to task complexity. Our results highlight the challenges and
opportunities for enhancing the use of synthetic data and vision foundation
models in practical applications.
Related papers
- Plots Unlock Time-Series Understanding in Multimodal Models [5.792074027074628]
This paper proposes a method that leverages the existing vision encoders of multimodal foundation models to "see" time-series data via plots.
Our empirical evaluations show that this approach outperforms providing the raw time-series data as text.
To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks.
arXiv Detail & Related papers (2024-10-03T16:23:13Z) - Synthetic data augmentation for robotic mobility aids to support blind and low vision people [5.024531194389658]
Robotic mobility aids for blind and low-vision (BLV) individuals rely heavily on deep learning-based vision models.
The performance of these models is often constrained by the availability and diversity of real-world datasets.
In this study, we investigate the effectiveness of synthetic data, generated using Unreal Engine 4, for training robust vision models.
arXiv Detail & Related papers (2024-09-17T13:17:28Z) - A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance.
We propose a simple yet effective data augmentation approach by leveraging advancements in generative models.
Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Deep Domain Adaptation: A Sim2Real Neural Approach for Improving Eye-Tracking Systems [80.62854148838359]
Eye image segmentation is a critical step in eye tracking that has great influence over the final gaze estimate.
We use dimensionality-reduction techniques to measure the overlap between the target eye images and synthetic training data.
Our methods result in robust, improved performance when tackling the discrepancy between simulation and real-world data samples.
arXiv Detail & Related papers (2024-03-23T22:32:06Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Facial Emotion Recognition using Deep Residual Networks in Real-World
Environments [5.834678345946704]
We propose a facial feature extractor model trained on an in-the-wild and massively collected video dataset.
The dataset consists of a million labelled frames and 2,616 thousand subjects.
As temporal information is important to the emotion recognition domain, we utilise LSTM cells to capture the temporal dynamics in the data.
arXiv Detail & Related papers (2021-11-04T10:08:22Z) - Deflating Dataset Bias Using Synthetic Data Augmentation [8.509201763744246]
State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning.
The goal of this paper is to investigate the use of targeted synthetic data augmentation for filling gaps in real datasets for vision tasks.
Empirical studies on three different computer vision tasks of practical use to AVs consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance.
arXiv Detail & Related papers (2020-04-28T21:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.