Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
- URL: http://arxiv.org/abs/2601.22570v1
- Date: Fri, 30 Jan 2026 05:10:34 GMT
- Title: Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
- Authors: Aditya Sarkar, Yi Li, Jiacheng Cheng, Shlok Mishra, Nuno Vasconcelos,
- Abstract summary: This paper considers selective prediction for visual language foundation models.<n>We seek training-free approaches of low-complexity, applicable to any foundation model.<n>We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores.
- Score: 40.16419917667614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at https://github.com/kingston-aditya/MA-PaPSP.
Related papers
- Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning [33.269644831847636]
Image-Adaptive Prompt Learning (IAPL) is a novel paradigm that adjusts the prompts according to each input image, rather than fixing them after training.<n>IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets.
arXiv Detail & Related papers (2025-08-03T05:41:24Z) - Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning [36.25732435294088]
Two-view correspondence learning aims to discern true and false correspondences between image pairs.<n>Inspired by Mamba's inherent selectivity, we propose textbfCorrMamba, a textbfCorrespondence filter.<n>Our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20textdegree.
arXiv Detail & Related papers (2025-03-23T04:44:21Z) - Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency.<n>Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z) - Dual Caption Preference Optimization for Diffusion Models [53.218293277964165]
We introduce Dual Caption Preference Optimization (DCPO) to improve text-to-image diffusion models.<n>DCPO assigns two distinct captions to each preference pair, which reinforces the learning signal.<n>Experiments show that DCPO significantly improves image quality and relevance to prompts.
arXiv Detail & Related papers (2025-02-09T20:34:43Z) - How Language Models Prioritize Contextual Grammatical Cues? [3.9790222241649587]
We investigate how language models handle gender agreement when multiple gender cue words are present.
Our findings reveal striking differences in how encoder-based and decoder-based models prioritize and use contextual information for their predictions.
arXiv Detail & Related papers (2024-10-04T14:09:05Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization [26.63463867095924]
We propose a novel WTAL approach named Convex Combination Consistency between Neighbors (C$3$BN)
C$3$BN consists of two key ingredients: a micro data augmentation strategy that increases the diversity in-between adjacent snippets, and a macro-micro consistency regularization.
Experimental results demonstrate the effectiveness of C$3$BN on top of various baselines for WTAL with video-level and point-level supervisions.
arXiv Detail & Related papers (2022-05-01T05:30:53Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Self-Supervised Tuning for Few-Shot Segmentation [82.32143982269892]
Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples.
Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space.
This paper presents an adaptive framework tuning, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme.
arXiv Detail & Related papers (2020-04-12T03:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.