Fugu-MT 論文翻訳(概要): MobileCLIP2: Improving Multi-Modal Reinforced Training

論文の概要: MobileCLIP2: Improving Multi-Modal Reinforced Training

arxiv url: http://arxiv.org/abs/2508.20691v1
Date: Thu, 28 Aug 2025 11:50:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.36702
Title: MobileCLIP2: Improving Multi-Modal Reinforced Training
Title（参考訳）: MobileCLIP2: マルチモーダル強化トレーニングの改善
Authors: Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari,
Abstract要約: 我々はMobileCLIP2と呼ばれる新しいモデルのファミリーを訓練し、最先端のImageNet-1kゼロショット精度を低レイテンシで達成する。我々は,MobileCLIP2-BにおけるImageNet-1kの精度を,MobileCLIP-Bアーキテクチャと比較して2.2%改善した。
参考スコア（独自算出の注目度）: 65.61629555586948
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
Abstract（参考訳）: ゼロショット機能を備えたCLIPのような基礎的なイメージテキストモデルは、幅広いアプリケーションを可能にする。 MobileCLIPは3～15msのレイテンシと50～150Mのパラメータで、最先端のゼロショット精度を持つ最近の画像テキストモデルのファミリーである。 MobileCLIPの主な材料は、低レイテンシで軽量なアーキテクチャと、複数のキャプションジェネレータやCLIP教師からの知識蒸留を効率よく、スケーラブルで再現可能な新しいマルチモーダル強化トレーニングであった。本稿では,MobileCLIPのマルチモーダル強化トレーニングを改善する。 1)DFNデータセットで訓練されたCLIP教師アンサンブルの改善。 2) DFNデータセットを訓練し,高品質な画像キャプチャデータセットの多種多様な選択を微調整した。対照的な知識蒸留における温度調整の重要性,キャプションの多様性に対するキャプション・ジェネレータの微調整の有効性,および複数のモデルで生成した合成キャプションの組み合わせによる付加的改善などにより,新たな知見が得られた。我々はMobileCLIP2と呼ばれる新しいモデルのファミリーを訓練し、最先端のImageNet-1kゼロショット精度を低レイテンシで達成する。特に,MobileCLIP2-BにおけるImageNet-1kの精度は,MobileCLIP-Bアーキテクチャと比較して2.2%向上した。特に、MobileCLIP2-S4はImageNet-1k上のSigLIP-SO400M/14のゼロショット精度と2$\times$より小さく、2.5$\times$低レイテンシでDFN ViT-L/14で改善されている。トレーニング済みのモデル(https://github.com/apple/ml-mobileclip)とデータ生成コード(https://github.com/apple/ml-mobileclip-dr)をリリースします。データ生成コードは、分散スケーラブル処理を使用して、任意の教師による新しい強化データセットの作成を容易にする。

論文の概要: MobileCLIP2: Improving Multi-Modal Reinforced Training

関連論文リスト