Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions
- URL: http://arxiv.org/abs/2507.04884v1
- Date: Mon, 07 Jul 2025 11:16:44 GMT
- Title: Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions
- Authors: Christos Vlachos, Nikolaos Stylianou, Alexandra Fiotaki, Spiros Methenitis, Elisavet Palogiannidi, Themos Stafylakis, Ion Androutsopoulos,
- Abstract summary: We propose a pipeline to automatically produce realistic OR-CONVQA dialogs with annotations.<n>We generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions.<n>The retrieved information and the decontextualized question are then passed on to an LLM that generates the system's response.
- Score: 49.413959071830945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system's response.
Related papers
- Dialogizer: Context-aware Conversational-QA Dataset Generation from
Textual Sources [18.09705075305591]
We propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance.
We produce four ConvQA datasets by utilizing documents from multiple domains as the primary source.
arXiv Detail & Related papers (2023-11-09T06:03:11Z) - Social Commonsense-Guided Search Query Generation for Open-Domain
Knowledge-Powered Conversations [66.16863141262506]
We present a novel approach that focuses on generating internet search queries guided by social commonsense.
Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instruction-driven query generation.
arXiv Detail & Related papers (2023-10-22T16:14:56Z) - Conversational Tree Search: A New Hybrid Dialog Task [21.697256733634124]
We introduce Conversational Tree Search (CTS) as a new task that bridges the gap between FAQ-style information retrieval and task-oriented dialog.
Our results show that the new architecture combines the positive aspects of both the FAQ and dialog system used in the baseline and achieves higher goal completion.
arXiv Detail & Related papers (2023-03-17T19:50:51Z) - HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on
Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables.
The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z) - Building and Evaluating Open-Domain Dialogue Corpora with Clarifying
Questions [65.60888490988236]
We release a dataset focused on open-domain single- and multi-turn conversations.
We benchmark several state-of-the-art neural baselines.
We propose a pipeline consisting of offline and online steps for evaluating the quality of clarifying questions in various dialogues.
arXiv Detail & Related papers (2021-09-13T09:16:14Z) - Open-Retrieval Conversational Machine Reading [80.13988353794586]
In conversational machine reading, systems need to interpret natural language rules, answer high-level questions, and ask follow-up clarification questions.
Existing works assume the rule text is provided for each user question, which neglects the essential retrieval step in real scenarios.
In this work, we propose and investigate an open-retrieval setting of conversational machine reading.
arXiv Detail & Related papers (2021-02-17T08:55:01Z) - Saying No is An Art: Contextualized Fallback Responses for Unanswerable
Dialogue Queries [3.593955557310285]
Most dialogue systems rely on hybrid approaches for generating a set of ranked responses.
We design a neural approach which generates responses which are contextually aware with the user query.
Our simple approach makes use of rules over dependency parses and a text-to-text transformer fine-tuned on synthetic data of question-response pairs.
arXiv Detail & Related papers (2020-12-03T12:34:22Z) - Towards Data Distillation for End-to-end Spoken Conversational Question
Answering [65.124088336738]
We propose a new Spoken Conversational Question Answering task (SCQA)
SCQA aims at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora.
Our main objective is to build a QA system to deal with conversational questions both in spoken and text forms.
arXiv Detail & Related papers (2020-10-18T05:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.