Fugu-MT 論文翻訳(概要): STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

論文の概要: STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

arxiv url: http://arxiv.org/abs/2605.21261v1
Date: Wed, 20 May 2026 14:51:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.736755
Title: STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Title（参考訳）: STiTch: ゼロショット合成画像検索のための協調作業における意味的遷移と移動
Authors: Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo,
Abstract要約: 訓練なしゼロショット合成画像検索モデルは研究の関心が高まっている。最近の進歩は、期待されるターゲットキャプションの生成に焦点を当てている。トレーニング不要なゼロショットCIRタスクのための協調フレームワークにセマンティック・トランジションとトランスポーテーションを導入する。
参考スコア（独自算出の注目度）: 38.107904166193364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.
Abstract（参考訳）: トレーニング不要なゼロショット合成画像検索モデルは、近ごろ、その一般化性と、目に見えないマルチモーダル検索の柔軟性により、研究の関心が高まっている。 LLMの最近の進歩は、LLMの背後にある構成能力を探究することによって、期待されるターゲットキャプションの生成に焦点を当てている。効率的ではあるが、私たちはそれを見つける。 1) 生成されたキャプションは,入力画像とテキスト修正のセマンティックな違いにより,参照画像から予期せぬ特徴を導入する傾向にある。 2) 検索段階におけるポイント・ツー・ポイントのアライメントは, 多様な構成の取得に失敗する。これらの課題に対処するために、トレーニング不要なゼロショットCIRタスクのための協調フレームワークに、セマンティック・トランジションとトランスポーテーションを導入する。具体的には, LLM で推定される合成キャプションを考慮し, 埋め込み空間の遷移ベクトルを用いて改良し, 対象画像に近づけることを目的とする。 LLMとユーザインストラクションを組み合わせることで、改良されたキャプションはコア修正の意図をより集中させ、不要なノイズを除去する。さらに,検索段階における多彩なアライメントを探索するために,キャプションとイメージを離散分布としてモデル化し,検索タスクをセット・ツー・セットアライメントタスクとして再構成する。最後に、モーダル間の微粒なアライメントを考慮し、検索スコアを算出するために、双方向輸送距離を開発する。大規模な実験により,本手法は多くのCIRタスクに対して汎用的,効果的,有益であることが実証された。

論文の概要: STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

関連論文リスト