Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search
- URL: http://arxiv.org/abs/2602.08700v1
- Date: Mon, 09 Feb 2026 14:16:11 GMT
- Title: Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search
- Authors: Clemencia Siro, Zahra Abbasiantaeb, Yifei Yuan, Mohammad Aliannejadi, Maarten de Rijke,
- Abstract summary: We conduct a user study with 73 participants to investigate the role of images in conversational search.<n>We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context.
- Score: 59.907919633904775
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.
Related papers
- Seeing Through Words: Controlling Visual Retrieval Quality with Language Models [68.49490036960559]
We propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality.<n>Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms.<n>Our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.
arXiv Detail & Related papers (2026-02-24T18:20:57Z) - Exploring Rewriting Approaches for Different Conversational Tasks [63.56404271441824]
The exact rewriting approach may often depend on the use case and application-specific tasks supported by the conversational assistant.<n>We systematically investigate two different approaches, denoted as rewriting and fusion, on two fundamentally different generation tasks.<n>Our results indicate that the specific rewriting or fusion approach highly depends on the underlying use case and generative task.
arXiv Detail & Related papers (2025-02-26T06:05:29Z) - Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation.<n>Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models.<n>We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z) - Can Users Detect Biases or Factual Errors in Generated Responses in Conversational Information-Seeking? [13.790574266700006]
We investigate the limitations of response generation in conversational information-seeking systems.
The study addresses the problem of query answerability and the challenge of response incompleteness.
Our analysis reveals that it is easier for users to detect response incompleteness than query answerability.
arXiv Detail & Related papers (2024-10-28T20:55:00Z) - Asking Multimodal Clarifying Questions in Mixed-Initiative
Conversational Search [89.1772985740272]
In mixed-initiative conversational search systems, clarifying questions are used to help users who struggle to express their intentions in a single query.
We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information.
We collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images.
Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase.
arXiv Detail & Related papers (2024-02-12T16:04:01Z) - Estimating the Usefulness of Clarifying Questions and Answers for
Conversational Search [17.0363715044341]
We propose a method for processing answers to clarifying questions, moving away from previous work that simply appends answers to the original query.
Specifically, we propose a classifier for assessing usefulness of the prompted clarifying question and an answer given by the user.
Results demonstrate significant improvements over strong non-mixed-initiative baselines.
arXiv Detail & Related papers (2024-01-21T11:04:30Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - From A Glance to "Gotcha": Interactive Facial Image Retrieval with
Progressive Relevance Feedback [72.29919762941029]
We propose an end-to-end framework to retrieve facial images with relevance feedback progressively provided by the witness.
With no need of any extra annotations, our model can be applied at the cost of a little response effort.
arXiv Detail & Related papers (2020-07-30T18:46:25Z) - Guided Transformer: Leveraging Multiple External Sources for
Representation Learning in Conversational Search [36.64582291809485]
Asking clarifying questions in response to ambiguous or faceted queries has been recognized as a useful technique for various information retrieval systems.
In this paper, we enrich the representations learned by Transformer networks using a novel attention mechanism from external information sources.
Our experiments use a public dataset for search clarification and demonstrate significant improvements compared to competitive baselines.
arXiv Detail & Related papers (2020-06-13T03:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.