Fugu-MT 論文翻訳(概要): ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

論文の概要: ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

arxiv url: http://arxiv.org/abs/2602.22678v1
Date: Thu, 26 Feb 2026 06:51:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.562812
Title: ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
Title（参考訳）: ViCLIP-OT:ベトナム語画像検索のための第1基本視覚言語モデル
Authors: Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham,
Abstract要約: ベトナム語画像テキスト検索に特化して設計された基盤視覚言語モデルであるViCLIP-OTを導入する。提案フレームワークは,CLIPスタイルのコントラスト学習と類似グラフ正規化最適輸送(SIGROT)損失を統合し,グローバルな相互整合性を高める。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
Abstract（参考訳）: 画像テキスト検索は、インテリジェントなマルチメディアシステムにおいて基本的なコンポーネントとなっているが、既存の視覚言語モデルは、ハイソース言語に最適化されており、ベトナムなどの低リソース設定に最適化されていない。ベトナム語画像テキスト検索に特化して設計された基盤視覚言語モデルであるViCLIP-OTを導入する。提案フレームワークは,CLIPスタイルのコントラスト学習と類似グラフ正規化最適輸送(SIGROT)の損失を統合し,グローバルな相互整合性を高め,モダリティギャップを緩和する。ベトナムの3つのベンチマーク(UITOpenViIC、KTVIC、Crossmodal-3600)の大規模な実験は、ViCLIP-OTがドメイン内およびゼロショット設定の両方でCLIPとSigLIPベースラインを一貫して上回っていることを示している。 UIT-OpenViICでは、平均Recall@Kが67.34%、CLIPが5.75ポイント向上している。 Crossmodal-3600のゼロショット評価では、ViCLIPOTはCLIPを11.72ポイント上回っている。埋め込み空間解析は、アライメントの改善とモダリティギャップの低減をさらに確認する。その結果、SIGROTの統合は、低リソース言語におけるクロスモーダル検索の効果的かつスケーラブルな戦略を提供し、ベトナムや他の未表現言語文脈におけるインテリジェントマルチメディア検索システムに実践的な影響をもたらすことが示唆された。

論文の概要: ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

関連論文リスト