Fugu-MT 論文翻訳(概要): Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

論文の概要: Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

arxiv url: http://arxiv.org/abs/2606.07032v1
Date: Fri, 05 Jun 2026 08:23:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.638057
Title: Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets
Title（参考訳）: Genuine Zero-Shotコンポジション画像検索と一貫性のあるビデオソースデータセットのベンチマーク
Authors: Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu,
Abstract要約: Zero-Shot Composed Image Retrieval (ZS-CIR) は、参照画像と相対キャプションからなるクエリに基づいて、サンプルをトレーニングせずにターゲット画像を取得することを目的としている。既存のZS-CIRデータセットは、ノイズの多い画像ソースのため、参照画像とターゲット画像の完全な不一致に悩まされることが多い。 ZS-CIRの新しいベンチマークであるZeroSightを紹介する。
参考スコア（独自算出の注目度）: 61.420656457977195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.
Abstract（参考訳）: Zero-Shot Composed Image Retrieval (ZS-CIR) は、参照画像と相対キャプションからなるクエリに基づいて、サンプルをトレーニングせずにターゲット画像を取得することを目的としている。既存のZS-CIRデータセットは、ノイズの多い画像ソースによる参照とターゲットイメージの完全な不一致に悩まされることが多く、CLIPのようなモデルがトレーニングされているパブリックイメージデータセットを使用するため、真のゼロショットシナリオは達成できない。これらの課題に対処するために、ZS-CIRの新しいベンチマークであるZeroSightを紹介する。これには、ビデオからソースされた一貫した参照ターゲット対を持つデータセット、データ構築パイプライン、複数の正と負のターゲットイメージのランキングを考慮に入れた評価方法が含まれる。我々は、単一のビデオからフレームを抽出し、LLM支援手法を用いて相対的なキャプションを生成することにより、視覚的かつ意味的に一貫した参照ターゲットペアを確保する。真のゼロショットシナリオを保証するため、2022年3月31日以降に公開されたビデオデータを使用し、CLIPの事前トレーニングデータには含まれないことを確認する。さらに,3 対称整合性チェックによるハードネガティブターゲットの同定を効果的に行う,学習不要な SC4CIR (Symmetric Consistency for CIR) を提案する。この方法はプラグアンドプレイであり、様々なCIRメソッドとシームレスに統合され、性能が大幅に向上する。 27の手法による実験結果から、現在のZS-CIRデータセットと評価指標が拡張された検索性能をもたらし、CIR法の性能を誇張していることが明らかとなった。私たちのベンチマークとモデルはhttps://github.com/sotayang/ZeroSight.comでアクセスできます。

論文の概要: Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

関連論文リスト