Fugu-MT 論文翻訳(概要): Infusing fine-grained visual knowledge to Vision-Language Models

論文の概要: Infusing fine-grained visual knowledge to Vision-Language Models

arxiv url: http://arxiv.org/abs/2508.12137v1
Date: Sat, 16 Aug 2025 19:12:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.582357
Title: Infusing fine-grained visual knowledge to Vision-Language Models
Title（参考訳）: 視覚言語モデルへのきめ細かい視覚知識の注入
Authors: Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, Ondřej Chum,
Abstract要約: 大規模コントラスト学習による視覚・言語モデル(VLM)の作成本稿では,VLMの広義マルチモーダル知識の細粒度ドメイン適応と保持の最適バランスを実現するための微調整手法を提案する。特に微調整時にテキストデータや元のテキストエンコーダを使わずに、視覚的テキストアライメントを維持する。
参考スコア（独自算出の注目度）: 5.487134463783365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model's general-purpose visual and cross-modal capabilities. In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed at knowledge retention and propose an efficient and effective combination strategy. Additionally, we address the commonly overlooked yet critical aspects of validation set design and hyperparameter tuning to ensure reproducibility and robust generalization across datasets and pretrained models. We extensively evaluate our method on both fine-grained and coarse-grained image-image and image-text retrieval benchmarks. Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning. Code and model checkpoints: https://github.com/nikosips/infusing .
Abstract（参考訳）: 大規模なコントラスト事前学習は、様々な視覚的・マルチモーダルなタスクに有効な表現(埋め込み)を生成する強力なビジョン・アンド・ランゲージモデル(VLM)を生成する。しかし、これらの事前訓練された埋め込みは、細かなオープンセットのビジュアル検索に最適であり、そこでは、注釈付きドメイン固有のサンプルを使用して視覚エンコーダを微調整する必要がある。このような微調整を内在的に行うと、大惨事に陥り、モデルの汎用的な視覚的・横断的な能力は著しく低下する。本研究は,VLMの広義マルチモーダル知識の細粒度適応と保持の最適バランスを実現するための微調整手法を提案する。継続学習文学からインスピレーションを得て,知識保持を目的とした標準正規化手法を体系的に分析し,効率的かつ効果的な組み合わせ戦略を提案する。さらに、検証セットの設計とハイパーパラメータチューニングの一般的に見過ごされている重要な側面に対処し、データセットと事前訓練されたモデル間の再現性と堅牢な一般化を保証する。細粒度および粗粒度の画像画像および画像テキスト検索ベンチマークにおいて,本手法を広範囲に評価した。特に微調整時にテキストデータや元のテキストエンコーダを使わずに、視覚的テキストアライメントを維持する。コードとモデルチェックポイント: https://github.com/nikosips/infusing 。

論文の概要: Infusing fine-grained visual knowledge to Vision-Language Models

関連論文リスト