Related papers: DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production

DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production

URL: http://arxiv.org/abs/2412.08069v1
Date: Wed, 11 Dec 2024 03:31:36 GMT
Title: DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production
Authors: Xiaoyun Liang, Jingyi Ren, Jiayi Qi, Chao Peng, Bo Jiang,
Abstract summary: We present DialogAgent, an automated tool for generating synthetic training data that closely mimics real developer interactions.<n>The tool significantly reduces the reliance on manual data generation, increasing efficiency by 4.8 times compared to traditional methods.
Score: 5.030384831047144
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have become increasingly integral to enhancing developer productivity, particularly in code generation, comprehension, and repair tasks. However, fine-tuning these models with high-quality, real-world data is challenging due to privacy concerns and the lack of accessible, labeled datasets. In this paper, we present DialogAgent, an automated tool for generating synthetic training data that closely mimics real developer interactions within Integrated Development Environments (IDEs). DialogAgent enables the production of diverse, high-fidelity query-response pairs by simulating multi-turn dialogues and contextual behaviors observed in real-world programming scenarios. The tool significantly reduces the reliance on manual data generation, increasing efficiency by 4.8 times compared to traditional methods. Our experiments and online deployment demonstrate substantial improvements in model performance for code-related question-answering tasks: the acceptance rate of responses generated by our in-house model is improved by 33%, after training on synthesized data generated by DialogAgent.

Related papers

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay [86.01901238059261]
APIGen-MT is a framework that generates verifiable and diverse multi-turn agent data. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $tau$-bench and BFCL benchmarks.
arXiv Detail & Related papers (2025-04-04T17:13:57Z)
SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection [15.626772502710867]
We propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy.
arXiv Detail & Related papers (2025-03-19T09:32:52Z)
Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.<n>Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback [62.235925602004535]
We introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task. Agent's goal is to improve student performance. We support 3 diverse tasks (math, code, and VQA) and test multiple students and teachers.
arXiv Detail & Related papers (2024-10-08T17:20:37Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
A Transformer-Based Approach for Smart Invocation of Automatic Code Completion [14.34818742116731]
We develop a machine learning model that can predict when to invoke a code completion tool. We collect a dataset of 200k developer interactions with our cross-IDE code completion plugin. Our results indicate that our small-scale transformer model significantly outperforms the baseline.
arXiv Detail & Related papers (2024-05-23T16:19:32Z)
Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts [10.829227084902428]
We investigate the feasibility and effectiveness of Large Language Models (LLMs)-based data generation in source-grounded information-seeking dialogs. We create MISeD -- Meeting Information Seeking Dialogs dataset -- a dataset of information-seeking dialogs focused on meeting transcripts. Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.
arXiv Detail & Related papers (2024-05-02T09:35:06Z)
CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models [8.123272461141815]
We introduce the TinyAgent model, trained on a meticulously curated high-quality dataset. We also present the Collaborative Multi-Agent Tuning (CMAT) framework, an innovative system designed to augment language agent capabilities. In this research, we propose a new communication agent framework that integrates multi-agent systems with environmental feedback mechanisms.
arXiv Detail & Related papers (2024-04-02T06:07:35Z)
A Model-Agnostic Data Manipulation Method for Persona-based Dialogue Generation [107.82729587882397]
It is expensive to scale up current persona-based dialogue datasets. Each data sample in this task is more complex to learn with than conventional dialogue data. We propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model.
arXiv Detail & Related papers (2022-04-21T03:49:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.