PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
- URL: http://arxiv.org/abs/2603.01493v1
- Date: Mon, 02 Mar 2026 06:02:40 GMT
- Title: PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
- Authors: Tianyi Xu, Rong Shan, Junjie Wu, Jiadeng Huang, Teng Wang, Jiachen Zhu, Wenteng Chen, Minxin Tu, Quantao Dou, Zhaoxiang Wang, Changwang Zhang, Weinan Zhang, Jun Wang, Jianghao Lin,
- Abstract summary: PhotoBench is the first benchmark constructed from authentic, personal albums.<n>It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning.
- Score: 29.907367363360652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
Related papers
- DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z) - Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval [6.804414686833417]
PRISm is a multimodal framework that advances image-to-image retrieval through two novel components.<n>The Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image.<n>The Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings.
arXiv Detail & Related papers (2025-12-20T15:57:46Z) - The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment [105.31858867473845]
ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing.<n>In experiments, ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
arXiv Detail & Related papers (2025-11-25T18:40:25Z) - Open Multimodal Retrieval-Augmented Factual Image Generation [86.34546873830152]
We introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG)<n> ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation.<n>Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines.
arXiv Detail & Related papers (2025-10-26T04:13:31Z) - FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus [10.615833390806486]
Multi-subject personalized image generation aims to synthesize customized images containing multiple specified subjects without requiring test-time optimization.<n>We present FocusDPO, a framework that adaptively identifies focus regions based on dynamic semantic correspondence and supervision image complexity.
arXiv Detail & Related papers (2025-09-01T07:06:36Z) - TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [96.72318842152148]
We propose a unified framework for text-to-image generation and retrieval with one single Large Multimodal Model (LMM)<n> Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner.<n>We then propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion
Models [19.519789922033034]
PhotoVerse is an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains.
After a single training phase, our approach enables generating high-quality images within only a few seconds.
arXiv Detail & Related papers (2023-09-11T19:59:43Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.