Fugu-MT 論文翻訳(概要): TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

論文の概要: TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

arxiv url: http://arxiv.org/abs/2604.12012v1
Date: Mon, 13 Apr 2026 20:00:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.096608
Title: TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
Title（参考訳）: TIPSv2: 拡張パッチテキストアライメントによるビジョンランゲージ事前トレーニングの強化
Authors: Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, André Araujo,
Abstract要約: iBOT++は、一般的に使用されるiBOTマスクの画像目的のアップグレードである。視覚言語による事前学習の効率と有効性を改善するため,学習レシピの指数移動平均設定を変更した。我々は,幅広い下流アプリケーションに適した画像テキストエンコーダモデルであるTIPSv2を開発した。
参考スコア（独自算出の注目度）: 43.16091854849133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .
Abstract（参考訳）: 視覚言語事前学習の最近の進歩は、分類、検索、セグメンテーション、深度予測など、多くの下流コンピュータビジョンアプリケーションに大きな改善をもたらした。しかし、これらのモデルがいまだに苦戦している基本的な機能は、密集したパッチ表現と対応する概念のテキスト埋め込みの整合性である。本研究では,この重要な課題を考察し,基礎的な視覚言語モデルにおいて,その能力を高める新しい手法を提案する。まず、パッチレベルの蒸留処理によって、密集したパッチテキストのアライメントが大幅に向上することを明らかにします。驚くべきことに、蒸留された学生モデルのパッチテキストアライメントは、教師モデルのアライメントをはるかに上回っています。この観察は、事前学習のレシピの変更を検討するきっかけとなり、一般的に使用されるiBOTマスクの画像目的へのアップグレードであるiBOT++を提案することになった。これにより、事前訓練されたモデルのパッチテキストアライメントが劇的に向上する。さらに、視覚言語による事前学習の効率と効果を向上させるため、学習レシピにおける指数的な移動平均設定を変更し、異なる粒度の合成キャプションの恩恵を受けるためのキャプションサンプリング戦略を導入する。これらのコンポーネントを組み合わせることで、幅広い下流アプリケーションに適した画像テキストエンコーダモデルであるTIPSv2を開発する。 9つのタスクと20のデータセットに関する包括的な実験を通じて、私たちは、一般的に、最近のビジョンエンコーダモデルと同等以上のパフォーマンスを示す。コードとモデルは、プロジェクトのページ(https://gdm-tipsv2.github.io/)からリリースされます。

論文の概要: TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

関連論文リスト