Related papers: PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning

PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning

URL: http://arxiv.org/abs/2509.21926v1
Date: Fri, 26 Sep 2025 06:13:40 GMT
Title: PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning
Authors: Jiahao Zhang, Bowen Wang, Hong Liu, Yuta Nakashima, Hajime Nagahara,
Abstract summary: Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks.<n>VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions.<n>We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs.
Score: 38.70994263844236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions. We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi-object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset-level shift (e.g., from COCO to Pascal) and label-space shift (e.g., FSS-1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.

Related papers

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment [28.2773807732662]
Large-scale natural image-text datasets often suffer from loose semantic alignment due to weak supervision.<n>We propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures.<n>Two shared robustness pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning.
arXiv Detail & Related papers (2025-08-08T16:23:05Z)
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [16.90061119174727]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z)
Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning [10.630297877530614]
We propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection.
arXiv Detail & Related papers (2024-07-02T07:35:21Z)
Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations [0.033483662989441935]
Large vision-language contrastive models (VLCMs) have become foundational, demonstrating remarkable success across a variety of downstream tasks.<n>Despite their advantages, these models inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment.<n>This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications.
arXiv Detail & Related papers (2024-05-22T22:03:11Z)
Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration. Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents. The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
Weakly Supervised Contrastive Learning [68.47096022526927]
We introduce a weakly supervised contrastive learning framework (WCL) to tackle this issue. WCL achieves 65% and 72% ImageNet Top-1 Accuracy using ResNet50, which is even higher than SimCLRv2 with ResNet101.
arXiv Detail & Related papers (2021-10-10T12:03:52Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning [83.48587570246231]
Visual Similarity plays an important role in many computer vision applications. Deep metric learning (DML) is a powerful framework for learning such similarities. We propose and study multiple complementary learning tasks, targeting conceptually different data relationships. We learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance.
arXiv Detail & Related papers (2020-04-28T12:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.