Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
- URL: http://arxiv.org/abs/2510.11295v2
- Date: Thu, 30 Oct 2025 20:31:38 GMT
- Title: Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
- Authors: Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich, Thomas Seidl,
- Abstract summary: HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set.<n>Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated.
- Score: 50.6117007117789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
Related papers
- ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking [44.58919028628059]
Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming.<n>Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality.<n>To address this problem, we propose the Supervise with Critical Thinking (ACT) data pipeline.
arXiv Detail & Related papers (2025-11-13T00:32:30Z) - Fine-tuning can Help Detect Pretraining Data from Large Language Models [7.7209640786782385]
Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%.<n>We introduce a novel and effective method termed Fine-tuned Score Deviation(FSD), which improves the performance of current scoring functions for pretraining data detection.<n>In particular, we propose to measure the deviation distance of current scores after fine-tuning on a small amount of unseen data within the same domain.
arXiv Detail & Related papers (2024-10-09T15:36:42Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Effects of Human Adversarial and Affable Samples on BERT Generalization [12.000570944219515]
We examine the impact of training data quality, not quantity, on a model's generalizability.
We find that for a fixed size of training samples, as a rule of thumb, having 10-30% h-adversarial instances improves the precision.
arXiv Detail & Related papers (2023-10-12T03:20:43Z) - On the Connection between Pre-training Data Diversity and Fine-tuning
Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity.
We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - A semi-supervised Teacher-Student framework for surgical tool detection
and localization [2.41710192205034]
We introduce a semi-supervised learning (SSL) framework in surgical tool detection paradigm.
In the proposed work, we train a model with labeled data which initialises the Teacher-Student joint learning.
Our results on m2cai16-tool-locations dataset indicate the superiority of our approach on different supervised data settings.
arXiv Detail & Related papers (2022-08-21T17:21:31Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.