Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
- URL: http://arxiv.org/abs/2508.18132v1
- Date: Mon, 25 Aug 2025 15:38:56 GMT
- Title: Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
- Authors: Hung-Chun Hsu, Yuan-Ching Kuo, Chao-Han Huck Yang, Szu-Wei Fu, Hanrong Ye, Hongxu Yin, Yu-Chiang Frank Wang, Ming-Feng Tsai, Chuan-Ju Wang,
- Abstract summary: E-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions.<n>We propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval.<n>Our approach builds on a generative retriever, further augmented with a test-time reranking mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue.
- Score: 70.94563079082751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid evolution of e-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions. Recent advances in multimodal generative retrieval -- particularly those leveraging multimodal large language models (MLLMs) as retrievers -- have shown promise. However, most existing methods are tailored to single-turn scenarios and struggle to model the evolving intent and iterative nature of multi-turn dialogues when applied naively. Concurrently, test-time scaling has emerged as a powerful paradigm for improving large language model (LLM) performance through iterative inference-time refinement. Yet, its effectiveness typically relies on two conditions: (1) a well-defined problem space (e.g., mathematical reasoning), and (2) the model's ability to self-correct -- conditions that are rarely met in conversational product search. In this setting, user queries are often ambiguous and evolving, and MLLMs alone have difficulty grounding responses in a fixed product corpus. Motivated by these challenges, we propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval. Our approach builds on a generative retriever, further augmented with a test-time reranking (TTR) mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue. Experiments across multiple benchmarks show consistent improvements, with average gains of 14.5 points in MRR and 10.6 points in nDCG@1.
Related papers
- Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval [27.493644447594367]
MCMR (Multi-Conditional Multimodal Retrieval) is a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries.<n>It spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture.<n>We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities.
arXiv Detail & Related papers (2026-03-01T12:53:47Z) - UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams [53.78257200138774]
We propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules.<n>We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily.
arXiv Detail & Related papers (2026-01-14T14:31:16Z) - MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction [52.89860691282002]
Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce.<n>Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data.<n>We introduce textscmodelname, a multi-agent debate framework that employs multiple MLLM agents to iteratively refine inferences.
arXiv Detail & Related papers (2025-10-07T06:27:42Z) - F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model [18.21432204057241]
Fine-grained Fragment Retrieval (FFR) is a task requiring models to locate query-relevant fragments.<n>F2RVLM is a generative retrieval model trained in a two-stage paradigm.<n>It outperforms popular Vision-Language Models in both in-domain and real-domain settings.
arXiv Detail & Related papers (2025-08-25T06:42:47Z) - LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues [55.332261548836485]
LoopServe is an adaptive dual-phase inference framework for large language models in multi-turn dialogues.<n>It consistently achieves superior effectiveness compared to existing baselines.<n>It significantly accelerates inference across a wide range of long-context dialogue tasks.
arXiv Detail & Related papers (2025-07-18T06:12:08Z) - Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing [43.75154489681047]
We propose a novel framework leveraging test-time scaling for Multi-Document Summarization (MDS)<n>Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary.<n>To evaluate our method effectively, we also introduce two new LLM-based metrics: the Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (LLM-ACU) score.
arXiv Detail & Related papers (2025-02-27T23:34:47Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Re-ranking the Context for Multimodal Retrieval Augmented Generation [28.63893944806149]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge to generate a response within a context.<n>RAG systems face unique challenges: (i) the retrieval process may select irrelevant entries to user query (e.g., images, documents), and (ii) vision-language models or multi-modal language models like GPT-4o may hallucinate when processing these entries to generate RAG output.<n>We show that by using a more advanced relevancy measure, one can enhance the retrieval process by selecting more relevant pieces from the knowledge-base and eliminate
arXiv Detail & Related papers (2025-01-08T18:58:22Z) - CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.<n>It is composed of drawing prompt alignment, careful training data curation, and error correction.<n>Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Enhancing Multi-modal and Multi-hop Question Answering via Structured
Knowledge and Unified Retrieval-Generation [33.56304858796142]
Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities.
Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences.
We propose a Structured Knowledge and Unified Retrieval-Generation (RG) approach to address these issues.
arXiv Detail & Related papers (2022-12-16T18:12:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.