Demographic User Modeling for Social Robotics with Multimodal Pre-trained Models
- URL: http://arxiv.org/abs/2502.10642v1
- Date: Sat, 15 Feb 2025 02:38:58 GMT
- Title: Demographic User Modeling for Social Robotics with Multimodal Pre-trained Models
- Authors: Hamed Rahimi, Mouad Abrini, Mahdi Khoramshahi, Mohamed Chetouani,
- Abstract summary: We introduce two datasets specifically curated to represent demographic characteristics from user facial images.<n>We evaluate the performance of a prominent contrastive multimodal pre-trained model, CLIP, on these datasets.<n>To address this, we propose adopting a masked image modeling strategy to improve generalization and better capture subtle demographic attributes.
- Score: 4.2185937778110825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the performance of multimodal pre-trained models in user profiling tasks based on visual-linguistic demographic data. These models are critical for adapting to the needs and preferences of human users in social robotics, thereby providing personalized responses and enhancing interaction quality. First, we introduce two datasets specifically curated to represent demographic characteristics derived from user facial images. Next, we evaluate the performance of a prominent contrastive multimodal pre-trained model, CLIP, on these datasets, both in its out-of-the-box state and after fine-tuning. Initial results indicate that CLIP performs suboptimal in matching images to demographic descriptions without fine-tuning. Although fine-tuning significantly enhances its predictive capacity, the model continues to exhibit limitations in effectively generalizing subtle demographic nuances. To address this, we propose adopting a masked image modeling strategy to improve generalization and better capture subtle demographic attributes. This approach offers a pathway for enhancing demographic sensitivity in multimodal user modeling tasks.
Related papers
- LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z) - D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models [4.56877715768796]
We propose a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models.<n>We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
arXiv Detail & Related papers (2025-12-10T20:41:29Z) - PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins [20.77710199900999]
We introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins.<n>Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we benchmark PersonaTwin against standard LLM outputs.<n> Experimental results show that our framework produces simulation fidelity on par with settings.
arXiv Detail & Related papers (2025-07-30T04:57:30Z) - Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features [54.63343151319368]
This paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features.<n>In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features.<n>After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim.
arXiv Detail & Related papers (2025-06-24T15:40:11Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.<n>With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.<n>Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - You Only Submit One Image to Find the Most Suitable Generative Model [48.67303250592189]
We propose a novel setting called Generative Model Identification (GMI)<n>GMI aims to enable the user to identify the most appropriate generative model(s) for the user's requirements efficiently.
arXiv Detail & Related papers (2024-12-16T14:46:57Z) - Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models [1.9890559505377343]
We introduce a new method for generating such data by integrating human keypoints with traditional visual features like captions and bounding boxes.
Our approach produces datasets designed for fine-tuning models to excel in human-centric activities.
Experimental results show an overall improvement of 21.18% compared to the original LLaVA-7B model.
arXiv Detail & Related papers (2024-09-14T05:07:57Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models [12.497110441765274]
Existing methods for identity inference in CLIP models require querying the model with full PII.
Applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model.
We propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models.
arXiv Detail & Related papers (2024-05-23T12:54:25Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - Asymptotically Fair Participation in Machine Learning Models: an Optimal
Control Perspective [21.962258178900065]
The performance of state-of-the-art machine learning models often deteriorates when testing on demographics that are under-represented in the training dataset.
We aim to address the problem of achieving skewedally fair participation via optimal control formulation.
We apply an efficient implementation of Pontryagin's maximum principle to estimate the optimal control solution.
arXiv Detail & Related papers (2023-11-16T22:28:38Z) - Zero-shot racially balanced dataset generation using an existing biased
StyleGAN2 [5.463417677777276]
We propose a methodology that leverages the biased generative model StyleGAN2 to create demographically diverse images of synthetic individuals.
By training face recognition models with the resulting balanced dataset containing 50,000 identities per race, we can improve their performance and minimize biases that might have been present in a model trained on a real dataset.
arXiv Detail & Related papers (2023-05-12T18:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.