Fugu-MT 論文翻訳(概要): T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

論文の概要: T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

arxiv url: http://arxiv.org/abs/2511.16107v1
Date: Thu, 20 Nov 2025 07:02:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-21 17:08:52.509383
Title: T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
Title（参考訳）: T2T-VICL: テキスト駆動VLMによるクロスタスクビジュアルインコンテキスト学習の境界の解き方
Authors: Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu,
Abstract要約: 大規模言語モデル (LLM) では、インコンテキスト学習 (ICL) は入力コンテキストで提供される小さなデモを条件付けして新しいタスクを実行する。ビジュアル・イン・コンテクスト・ラーニング(VICL)の最近の進歩は、統合視覚言語モデル(VLM)による下流タスクの解決に期待できる能力を示している。
参考スコア（独自算出の注目度）: 15.649508617993538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
Abstract（参考訳）: 大規模言語モデル (LLM) では、インコンテキスト学習 (ICL) は入力コンテキストで提供される小さなデモを条件付けして新しいタスクを実行する。近年のビジュアル・イン・コンテクスト・ラーニング(VICL)は、視覚言語モデル(VLM)による下流タスクの解決に有望な能力を示している。視覚的プロンプトと対象画像が異なる視覚的タスクに由来する場合、VLMは依然としてVICLを有効化できるだろうか? 本稿では,VLMにおけるクロスタスクVICLの可能性を検討するために,完全に協調的なパイプラインであるT2T-VICLを提案する。基本的には、2つの異なる低レベル視覚タスクの違いを暗黙的に記述するテキストプロンプトを生成して選択する機構を設計し、最初のクロスタスクVICLデータセットを構築する。そこで本研究では,従来の評価指標と知覚的スコアに基づく推論を組み合わせ,クロスタスクVICLを実現する新しい推論フレームワークを提案する。提案手法は,9つのクロスタスクシナリオにおけるトップレベル結果と2層性能を10の追加シナリオで達成し,VLM内のクロスタスクVICLの境界を開放する。

論文の概要: T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

関連論文リスト