Related papers: T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

URL: http://arxiv.org/abs/2511.16107v1
Date: Thu, 20 Nov 2025 07:02:06 GMT
Title: T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
Authors: Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu,
Abstract summary: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context.<n>Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs)
Score: 15.649508617993538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

Related papers

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds [56.749972238005604]
This paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos.<n> TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces.<n>In turn, logit scores define embedding directions for conditional video generation.
arXiv Detail & Related papers (2025-11-23T09:12:48Z)
4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z)
Test-Time Visual In-Context Tuning [85.62916644835902]
Visual in-context learning (VICL) allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.<n>While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts.<n>We propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample.
arXiv Detail & Related papers (2025-03-27T17:59:52Z)
Advancing Prompt Learning through an External Layer [24.77977865016954]
We propose a paradigm called EnPrompt with a novel External Layer (EnLa) The learnable external layer is built upon valid embeddings of pre-trained CLIP. Four experiments demonstrate that our method outperforms the existing prompt learning method.
arXiv Detail & Related papers (2024-07-29T03:30:09Z)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)<n>It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z)
Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z)
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? [158.96530466189986]
multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. We investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
arXiv Detail & Related papers (2023-11-29T14:08:53Z)
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models. We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment. We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language. We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language. We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data. We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.