q2d: Turning Questions into Dialogs to Teach Models How to Search
- URL: http://arxiv.org/abs/2304.14318v2
- Date: Tue, 26 Dec 2023 16:00:48 GMT
- Title: q2d: Turning Questions into Dialogs to Teach Models How to Search
- Authors: Yonatan Bitton, Shlomi Cohen-Ganor, Ido Hakimi, Yoad Lewenberg, Roee
Aharoni, Enav Weinreb
- Abstract summary: We propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions.
Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale.
- Score: 11.421839177607147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the exciting capabilities of recent language models for dialog is
their ability to independently search for relevant information to ground a
given dialog response. However, obtaining training data to teach models how to
issue search queries is time and resource consuming. In this work, we propose
q2d: an automatic data generation pipeline that generates information-seeking
dialogs from questions. We prompt a large language model (PaLM) to create
conversational versions of question answering datasets, and use it to improve
query generation models that communicate with external search APIs to ground
dialog responses. Unlike previous approaches which relied on human written
dialogs with search queries, our method allows to automatically generate
query-based grounded dialogs with better control and scale. Our experiments
demonstrate that: (1) For query generation on the QReCC dataset, models trained
on our synthetically-generated data achieve 90%--97% of the performance of
models trained on the human-generated data; (2) We can successfully generate
data for training dialog models in new domains without any existing dialog data
as demonstrated on the multi-hop MuSiQue and Bamboogle QA datasets. (3) We
perform a thorough analysis of the generated dialogs showing that humans find
them of high quality and struggle to distinguish them from human-written
dialogs.
Related papers
- Multi-Document Grounded Multi-Turn Synthetic Dialog Generation [22.7158929225259]
We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas.
We control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought prompting.
We support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog.
arXiv Detail & Related papers (2024-09-17T19:02:39Z) - Dialogizer: Context-aware Conversational-QA Dataset Generation from
Textual Sources [18.09705075305591]
We propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance.
We produce four ConvQA datasets by utilizing documents from multiple domains as the primary source.
arXiv Detail & Related papers (2023-11-09T06:03:11Z) - DialogStudio: Towards Richest and Most Diverse Unified Dataset
Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets.
Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - Controllable Dialogue Simulation with In-Context Learning [39.04491297557292]
textscDialogic is a dialogue simulation method based on large language model in-context learning.
Our method can rapidly expand a small set of dialogue data with minimum or zero human involvement.
Our simulated dialogues have near-human fluency and annotation accuracy.
arXiv Detail & Related papers (2022-10-09T06:32:58Z) - Manual-Guided Dialogue for Flexible Conversational Agents [84.46598430403886]
How to build and use dialogue data efficiently, and how to deploy models in different domains at scale can be critical issues in building a task-oriented dialogue system.
We propose a novel manual-guided dialogue scheme, where the agent learns the tasks from both dialogue and manuals.
Our proposed scheme reduces the dependence of dialogue models on fine-grained domain ontology, and makes them more flexible to adapt to various domains.
arXiv Detail & Related papers (2022-08-16T08:21:12Z) - DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for
Dialog Response Generation [80.45816053153722]
DialogVED introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses.
We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation.
arXiv Detail & Related papers (2022-04-27T16:18:15Z) - A Model-Agnostic Data Manipulation Method for Persona-based Dialogue
Generation [107.82729587882397]
It is expensive to scale up current persona-based dialogue datasets.
Each data sample in this task is more complex to learn with than conventional dialogue data.
We propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model.
arXiv Detail & Related papers (2022-04-21T03:49:54Z) - Reasoning in Dialog: Improving Response Generation by Context Reading
Comprehension [49.92173751203827]
In multi-turn dialog, utterances do not always take the full form of sentences.
We propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question.
arXiv Detail & Related papers (2020-12-14T10:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.