Self-Directed Synthetic Dialogues and Revisions Technical Report
- URL: http://arxiv.org/abs/2407.18421v1
- Date: Thu, 25 Jul 2024 22:42:36 GMT
- Title: Self-Directed Synthetic Dialogues and Revisions Technical Report
- Authors: Nathan Lambert, Hailey Schoelkopf, Aaron Gokaslan, Luca Soldaini, Valentina Pyatkin, Louis Castricato,
- Abstract summary: We introduce Self Directed Synthetic Dialogues (SDSD), an experimental dataset consisting of guided conversations of language models talking to themselves.
SDSD consists of multi-turn conversations generated with DBRX, Llama 2 70B, and Mistral Large, all instructed to follow a conversation plan generated prior to the conversation.
- Score: 16.587350874099638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data has become an important tool in the fine-tuning of language models to follow instructions and solve complex problems. Nevertheless, the majority of open data to date is often lacking multi-turn data and collected on closed models, limiting progress on advancing open fine-tuning methods. We introduce Self Directed Synthetic Dialogues (SDSD), an experimental dataset consisting of guided conversations of language models talking to themselves. The dataset consists of multi-turn conversations generated with DBRX, Llama 2 70B, and Mistral Large, all instructed to follow a conversation plan generated prior to the conversation. We also explore including principles from Constitutional AI and other related works to create synthetic preference data via revisions to the final conversation turn. We hope this work encourages further exploration in multi-turn data and the use of open models for expanding the impact of synthetic data.
Related papers
- ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis [80.34000499166648]
We propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues.
We apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow.
Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
arXiv Detail & Related papers (2024-10-24T05:45:04Z) - DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications [18.378069426713]
Existing research is constrained by general or niche datasets that lack sufficient scale for training dialogue systems.
We introduce Dia Synth - a synthetic dialogue generation framework capable of generating high-quality, contextually rich dialogues.
We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum.
arXiv Detail & Related papers (2024-09-25T07:03:31Z) - Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis [21.210982054134686]
Methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field.
Existing methods are trained on parallel data from all constituent modalities.
Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material.
arXiv Detail & Related papers (2024-04-30T15:22:19Z) - Does Collaborative Human-LM Dialogue Generation Help Information
Extraction from Human Dialogues? [55.28340832822234]
Problem-solving human dialogues in real applications can be much more complex than existing Wizard-of-Oz collections.
We introduce a human-in-the-loop dialogue generation framework capable of synthesizing realistic dialogues.
arXiv Detail & Related papers (2023-07-13T20:02:50Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - A Model-Agnostic Data Manipulation Method for Persona-based Dialogue
Generation [107.82729587882397]
It is expensive to scale up current persona-based dialogue datasets.
Each data sample in this task is more complex to learn with than conventional dialogue data.
We propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model.
arXiv Detail & Related papers (2022-04-21T03:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.