Fugu-MT 論文翻訳(概要): Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

論文の概要: Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

arxiv url: http://arxiv.org/abs/2509.00177v1
Date: Fri, 29 Aug 2025 18:24:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.111621
Title: Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders
Title（参考訳）: カテゴリーレベルのテキスト・画像検索の改善:拡散モデルと視覚エンコーダによる領域ギャップのブリッジ
Authors: Faizan Farooq Khan, Vladan Stojnić, Zakaria Laskar, Mohamed Elhoseiny, Giorgos Tolias,
Abstract要約: 本研究は,セマンティックカテゴリを指定または記述したクエリのテキスト・ツー・イメージ検索について検討する。生成拡散モデルを用いて,テキストクエリを視覚的なクエリに変換する。そして、視覚モデルと画像間の類似性を推定する。
参考スコア（独自算出の注目度）: 41.08205377881149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir
Abstract（参考訳）: 本研究は,セマンティックカテゴリを指定または記述したクエリのテキスト・ツー・イメージ検索について検討する。 CLIPのようなヴィジュアル・アンド・ランゲージ・モデル(VLM)は、単純なオープン語彙のソリューションを提供するが、テキストと画像は表現空間内の遠くの領域にマッピングされ、検索性能が制限される。このモダリティギャップを埋めるために、我々は2段階のアプローチを提案する。まず、生成拡散モデルを用いて、テキストクエリを視覚的なクエリに変換する。そして、視覚モデルと画像間の類似性を推定する。さらに,複数の生成した画像を1つのベクトル表現に組み合わせた集約ネットワークを導入し,問合せ条件の類似点を融合する。我々のアプローチは、視覚エンコーダ、VLM、テキスト・ツー・イメージ生成モデルの進歩を活用している。広範囲な評価の結果,テキストクエリのみに依存する検索手法は一貫して優れていた。ソースコードは、https://github.com/faixan-khan/cletirで入手できる。

論文の概要: Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

関連論文リスト