Fugu-MT 論文翻訳(概要): Revisiting Model Stitching In the Foundation Model Era

論文の概要: Revisiting Model Stitching In the Foundation Model Era

arxiv url: http://arxiv.org/abs/2603.12433v2
Date: Mon, 16 Mar 2026 06:49:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 13:51:29.060691
Title: Revisiting Model Stitching In the Foundation Model Era
Title（参考訳）: 基礎モデル時代におけるモデルスティッチの再検討
Authors: Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo,
Abstract要約: 我々は、目的、データ、モダリティの混合によって異なるビジョンファウンデーションモデル(VFM)の縫い合わせを再検討する。縫合点,縫合層ファミリー,トレーニング損失,下流タスクにまたがる体系的プロトコルを提案する。ターゲットモデルの垂直層における単純な特徴マッチング損失により、不均一なVFMは視覚タスク間で確実に縫合可能である。
参考スコア（独自算出の注目度）: 40.272485094046736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.
Abstract（参考訳）: モデルステッチは、あるモデルの初期の層(ソース)と別のモデルの後の層(ターゲット)をライトステッチ層を介して接続するものであり、表現整合性のプローブとして機能している。以前の研究では、同じデータセットでトレーニングされたモデルは、異なる初期化や目的にもかかわらず、縫合可能(無視可能な精度低下)であることがわかった。目的、データ、モダリティの混合(例えば、CLIP、DINOv2、SigLIP)が異なるビジョンファウンデーションモデル(VFM)について再検討する。異種VFMは縫合可能か? 縫合点,縫合層ファミリー,トレーニング損失,下流タスクにまたがる体系的プロトコルを提案する。 3つの発見がある。 1) ステッチ層トレーニングは, 縫合点における中間特徴にマッチするか, 特に浅縫合点において, 縫合点の精度を維持するために, 作業損失の終端闘争を最適化する従来の手法である。 2) 対象モデルの垂直層における単純な特徴整合損失により, 不均一なVFMは視覚タスク間で確実に縫合可能である。 (3) 深部縫合点では, 縫合モデルが, 縫合層の場合) わずかな推測オーバーヘッドで, いずれの構成モデルを上回ることができる。これらの知見に基づいて,VFM Stitch Tree (VST) を提案し,VFM間の初期層を共有しながら後層を保持し,複数のVFMを利用するマルチモーダルLCMに対して,制御可能な精度-遅延トレードオフをもたらす。本研究は, 診断プローブからの縫合を, 相補的VFM強度を統合し, 表現の相違点をピンポインティングするための実践的レシピへと引き上げるものである。

論文の概要: Revisiting Model Stitching In the Foundation Model Era

関連論文リスト