Fugu-MT 論文翻訳(概要): From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

論文の概要: From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

arxiv url: http://arxiv.org/abs/2604.06748v1
Date: Wed, 08 Apr 2026 07:13:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.386676
Title: From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks
Title（参考訳）: 静的からインタラクティブへ:ユーザ駆動タスクに視覚的インテクスト学習者を適用する
Authors: Carlos Schmidt, Simon Reiß,
Abstract要約: 我々は静的な視覚的インコンテキスト学習者をユーザ駆動システム、すなわちInteractive DeLVMに変換する。本研究は,ユーザ中心のビジュアルインコンテキスト学習における静的タスク適応と流体相互作用のギャップを埋めるものである。
参考スコア（独自算出の注目度）: 5.208702297063032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.
Abstract（参考訳）: ビジュアル・イン・コンテキスト学習モデルは,一組のインプット・アウトプット・ペアを活用することで,タスク固有の微調整を伴わずに迅速な一般化を実現することで,新しいタスクに適応するように設計されている。しかし、これらのモデルは基本的に静的なパラダイムで動作し、新しいタスクに適応できるが、スクリブルやクリック、バウンディングボックスなどのユーザが提供するガイダンスシグナルを組み込んで予測プロセスを操ったり洗練したりするためのメカニズムが欠如している。この制限は、特に現実世界のアプリケーションでは限定的であり、例えば、ユーザーがターゲットのオブジェクトにセグメンテーションをハイライトし、視覚的に変更すべき領域を示すか、複雑なシーンで特定の人物を分離してターゲットのポーズ推定を実行することで、モデル予測を積極的にガイドしたいと願っている。本研究では,静的な視覚的インコンテキスト学習者,特にDeLVMアプローチを高度に制御可能なユーザ駆動システム,すなわちインタラクティブなDeLVMに変換するための簡単な手法を提案する。具体的には、対話を直接入力-出力ペアにエンコードすることで、視覚的インコンテキスト学習の哲学をそのまま保ちます。我々の実験では,SOTAビジュアル・イン・コンテクスト学習モデルではインタラクション・キューを効果的に活用することができず,ユーザ・ガイダンスを完全に無視することが多い。対照的に,本手法は制御可能なユーザ誘導シナリオに優れ,対話型セグメンテーションに$+7.95%$IoU,指向型超解像に$+2.46$PSNR,対話型オブジェクト除去に$3.14%$LPIPSを達成している。これにより、ユーザ中心のビジュアルインコンテキスト学習のための静的な静的タスク適応と流体相互作用のギャップを埋める。

論文の概要: From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

関連論文リスト