Fugu-MT 論文翻訳(概要): INSID3: Training-Free In-Context Segmentation with DINOv3

論文の概要: INSID3: Training-Free In-Context Segmentation with DINOv3

arxiv url: http://arxiv.org/abs/2603.28480v1
Date: Mon, 30 Mar 2026 14:16:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.439225
Title: INSID3: Training-Free In-Context Segmentation with DINOv3
Title（参考訳）: INSID3: DINOv3によるトレーニング不要なインコンテキストセグメンテーション
Authors: Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth,
Abstract要約: INSID3は、凍結したDINOv3機能のみから、さまざまな粒度で概念を分割する、トレーニング不要のアプローチである。 1ショットのセマンティクス、部分、パーソナライズされたセグメンテーションで最先端の結果を達成し、以前の処理を +7.5 % mIoU で上回る。
参考スコア（独自算出の注目度）: 63.961377087673476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .
Abstract（参考訳）: In-context segmentation (ICS) は、オブジェクト、パーツ、パーソナライズされたインスタンスなど、任意の概念を分類することを目的としている。既存の仕事は頼りにしている一ドメイン内の結果を改善するが一般化を損なう微調整視覚基礎モデル(VFM) i) は、一般化を保存するが、アーキテクチャの複雑さと固定されたセグメント化の粒度をもたらす複数の凍結されたVFMを結合する。私たちはICSをミニマリストの観点から再考し、次のように尋ねる。単一の自己教師型バックボーンは、セマンティックマッチングとセグメンテーションの両方をサポートできますか? 本研究では,DINOv3の大規模自己教師機能により,空間構造が強く,意味的対応が強いことを示す。 InSID3(In-contextの例から、凍結したDINOv3機能のみから、さまざまな粒度で概念を分割するトレーニング不要のアプローチ)を導入する。 INSID3は1ショットセマンティック、パート、パーソナライズされたセグメンテーションの最先端の結果を達成し、3倍少ないパラメータを使用しながら、マスクやカテゴリレベルの監視を一切行わず、以前の作業よりも7.5%のmIoUを上回ります。コードはhttps://github.com/visinf/INSID3で入手できる。

論文の概要: INSID3: Training-Free In-Context Segmentation with DINOv3

関連論文リスト