Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL
- URL: http://arxiv.org/abs/2601.10117v1
- Date: Thu, 15 Jan 2026 06:53:59 GMT
- Title: Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL
- Authors: Wenwen Liao, Jianbo Yu, Yuansong Wang, Shifu Yan, Xiaofeng Yang,
- Abstract summary: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts.<n>VICL methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements.<n>We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts.
- Score: 4.215181054941225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.
Related papers
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - Enhancing Visual In-Context Learning by Multi-Faceted Fusion [6.852150407828682]
We introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion.<n>Our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts.<n>Experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization.
arXiv Detail & Related papers (2026-01-15T06:25:09Z) - Distinguishing Visually Similar Actions: Prompt-Guided Semantic Prototype Modulation for Few-Shot Action Recognition [18.527513690285364]
Few-shot action recognition aims to enable models to quickly learn new action categories from limited labeled samples.<n>This paper proposes a CLIP-SPM framework, which includes three components to address the challenges of temporal modeling and visual similarity.
arXiv Detail & Related papers (2025-12-22T05:13:58Z) - QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models [50.51641024244313]
We investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images.<n>Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC)<n>We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks.
arXiv Detail & Related papers (2025-11-05T05:49:48Z) - Text-guided Visual Prompt DINO for Generic Segmentation [31.33676182634522]
We propose Prompt-DINO, a text-guided visual Prompt DINO framework.<n>First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features.<n>Second, we design order-aligned query selection for DETR-based architectures.<n>Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model.
arXiv Detail & Related papers (2025-08-08T09:09:30Z) - Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching [31.42132290162457]
We introduce a new framework called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts.<n>Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models.<n>Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior 12% improvement in IMIM indicates our method efficiently mitigates the misalignment.
arXiv Detail & Related papers (2025-07-14T14:28:15Z) - TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z) - Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration [109.38288333994407]
Contrastive Prompt Learning (CPL) is a novel framework that fundamentally enhances prompt-task alignment.<n>Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
arXiv Detail & Related papers (2025-04-14T08:24:57Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z) - Modeling Paragraph-Level Vision-Language Semantic Alignment for
Multi-Modal Summarization [23.475411831792716]
We propose ViL-Sum to jointly model paragraph-level textbfVision-textbfLanguage Semantic Alignment and Multi-Modal textbfSummarization.
The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection.
Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2022-08-24T05:18:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.