DACO: Towards Application-Driven and Comprehensive Data Analysis via
Code Generation
- URL: http://arxiv.org/abs/2403.02528v1
- Date: Mon, 4 Mar 2024 22:47:58 GMT
- Title: DACO: Towards Application-Driven and Comprehensive Data Analysis via
Code Generation
- Authors: Xueqing Wu, Rui Zheng, Jingzhen Sha, Te-Lin Wu, Hanyu Zhou, Mohan
Tang, Kai-Wei Chang, Nanyun Peng, Haoran Huang
- Abstract summary: Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
- Score: 86.4326416303723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data analysis is a crucial analytical process to generate in-depth studies
and conclusive insights to comprehensively answer a given user query for
tabular data. In this work, we aim to propose new resources and benchmarks to
inspire future research on this crucial yet challenging and under-explored
task. However, collecting data analysis annotations curated by experts can be
prohibitively expensive. We propose to automatically generate high-quality
answer annotations leveraging the code-generation capabilities of LLMs with a
multi-turn prompting technique. We construct the DACO dataset, containing (1)
440 databases (of tabular data) collected from real-world scenarios, (2) ~2k
query-answer pairs that can serve as weak supervision for model training, and
(3) a concentrated but high-quality test set with human refined annotations
that serves as our main evaluation benchmark. We train a 6B supervised
fine-tuning (SFT) model on DACO dataset, and find that the SFT model learns
reasonable data analysis capabilities. To further align the models with human
preference, we use reinforcement learning to encourage generating analysis
perceived by human as helpful, and design a set of dense rewards to propagate
the sparse human preference reward to intermediate code generation steps. Our
DACO-RL algorithm is evaluated by human annotators to produce more helpful
answers than SFT model in 57.72% cases, validating the effectiveness of our
proposed algorithm. Data and code are released at
https://github.com/shirley-wu/daco
Related papers
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search [19.070305201045954]
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation.
We observe that only a subset of the data in constructed datasets plays a decisive role.
We introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA learning strategy for light fine-tuning.
arXiv Detail & Related papers (2024-04-16T05:29:14Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - RLBoost: Boosting Supervised Models using Deep Reinforcement Learning [0.0]
We present RLBoost, an algorithm that uses deep reinforcement learning strategies to evaluate a particular dataset and obtain a model capable of estimating the quality of any new data.
The results of the article show that this model obtains better and more stable results than other state-of-the-art algorithms such as LOO, DataShapley or DVRL.
arXiv Detail & Related papers (2023-05-23T14:38:33Z) - Critical Evaluation of LOCO dataset with Machine Learning [0.0]
This paper re-evaluates the so-called Logistics Objects in Context (LOCO) dataset.
LOCO is the first dataset for object detection in the field of intralogistics.
arXiv Detail & Related papers (2022-09-27T16:17:01Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - S^3-Rec: Self-Supervised Learning for Sequential Recommendation with
Mutual Information Maximization [104.87483578308526]
We propose the model S3-Rec, which stands for Self-Supervised learning for Sequential Recommendation.
For our task, we devise four auxiliary self-supervised objectives to learn the correlations among attribute, item, subsequence, and sequence.
Extensive experiments conducted on six real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art methods.
arXiv Detail & Related papers (2020-08-18T11:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.