Fugu-MT 論文翻訳(概要): ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

論文の概要: ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

arxiv url: http://arxiv.org/abs/2510.18795v1
Date: Tue, 21 Oct 2025 16:48:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.934207
Title: ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Title（参考訳）: ProCLIP: LLMベースの埋め込みによるプログレッシブビジョンランゲージアライメント
Authors: Xiaoxing Hu, Kaicheng Yang, Ziyong Feng, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang,
Abstract要約: オリジナルのCLIPテキストエンコーダは77トークンの最大入力長で制限されている。 ProCLIPはカリキュラムベースのプログレッシブ・ビジョン言語アライメントフレームワークである。
参考スコア（独自算出の注目度）: 51.11361080299977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP
Abstract（参考訳）: オリジナルのCLIPテキストエンコーダは77トークンの最大入力長で制限されており、長文を効果的に処理し、きめ細かなセマンティック理解を行う能力を損なう。さらに、CLIPテキストエンコーダは多言語入力をサポートしていない。これらの制限は、幅広いタスクに適用性を大幅に制限します。近年の研究では、CLIPテキストエンコーダをLLMベースの埋め込み器に置き換えて、長文処理、多言語理解、きめ細かい意味理解の能力を高めようとしている。しかし、LLMの表現空間とCLIPの視覚言語空間は、アライメント先行なしで独立に事前訓練されるため、コントラスト学習を用いた直接アライメントは、CLIPイメージエンコーダの内在的な視覚言語アライメントを阻害し、事前トレーニング中に得られる知識の未利用化につながる。この課題に対処するために,カリキュラムベースのプログレッシブ・ビジョン言語アライメント・フレームワークであるProCLIPを提案する。具体的には、ProCLIPはまず、CLIPのテキストエンコーダからLLMベースのエンコーダに知識を蒸留し、CLIPの豊富な事前学習知識を活用しながら、LLMインバーダとCLIPイメージエンコーダの最初のアライメントを確立する。その後、ProCLIPは画像テキストのコントラストチューニングを通じてCLIPイメージエンコーダとLLMベースのエンコーダを連携させ、自己蒸留正則化を用いてオーバーフィッティングを回避する。より効果的なアライメントを実現するために、表現継承とコントラストチューニングの間、インスタンスセマンティックアライメントの損失と埋め込み構造アライメントの損失を用いる。コードはhttps://github.com/VisionXLab/ProCLIPで入手できる。

論文の概要: ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

関連論文リスト