VIOLA: Towards Video In-Context Learning with Minimal Annotations
- URL: http://arxiv.org/abs/2601.15549v1
- Date: Thu, 22 Jan 2026 00:35:30 GMT
- Title: VIOLA: Towards Video In-Context Learning with Minimal Annotations
- Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma,
- Abstract summary: We introduce VIOLA, a framework that synergizes minimal expert supervision with abundant unlabeled data.<n>Our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
- Score: 20.810620293371027
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
Related papers
- Modest-Align: Data-Efficient Alignment for Vision-Language Models [67.48633659305592]
Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
arXiv Detail & Related papers (2025-10-24T16:11:10Z) - Optimal Learning from Label Proportions with General Loss Functions [33.827617632719864]
We introduce a novel and versatile low-variance de-biasing methodology to learn from aggregate label information.<n>Our approach exhibits remarkable flexibility, seamlessly accommodating a broad spectrum of practically relevant loss functions.<n>We empirically validate the efficacy of our proposed approach across a diverse array of benchmark datasets.
arXiv Detail & Related papers (2025-09-18T16:53:32Z) - Feeding LLM Annotations to BERT Classifiers at Your Own Risk [14.533304890042361]
Using LLM-generated labels to fine-tune smaller encoder-only models for text classification has gained popularity in various settings.<n>We demonstrate how the perennial curse of training on synthetic data manifests itself in this specific setup.<n>Compared to models trained on gold labels, we observe not only the expected performance degradation in accuracy and F1 score, but also increased instability across training runs and premature performance plateaus.
arXiv Detail & Related papers (2025-04-21T20:54:55Z) - Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.<n>Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.<n>We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.<n> Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations [17.832223891973353]
Sequence labeling remains a significant challenge in low-resource, domain-specific scenarios.<n>We propose a novel framework that combines an LLM-based knowledge enhancement workflow with a span-based Knowledge Fusion for Rich and Efficient Extraction model.<n>Our approach achieves state-of-the-art performance, effectively addressing the challenges posed by low-resource settings.
arXiv Detail & Related papers (2025-01-31T12:39:28Z) - An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification [2.0930389307057427]
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios.
We propose Diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens.
arXiv Detail & Related papers (2024-09-05T02:51:28Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities.
Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool.
We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z) - Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels [26.542718087103665]
SemiMatch is a semi-supervised solution for establishing dense correspondences across semantically similar images.
Our framework generates the pseudo-labels using the model's prediction itself between source and weakly-augmented target, and uses pseudo-labels to learn the model again between source and strongly-augmented target.
In experiments, SemiMatch achieves state-of-the-art performance on various benchmarks, especially on PF-Willow by a large margin.
arXiv Detail & Related papers (2022-03-30T03:52:50Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.