Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow
- URL: http://arxiv.org/abs/2306.07209v7
- Date: Sat, 05 Oct 2024 22:55:15 GMT
- Title: Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow
- Authors: Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang,
- Abstract summary: Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
- Score: 49.724842920942024
- License:
- Abstract: Industries such as finance, meteorology, and energy generate vast amounts of data daily. Efficiently managing, processing, and displaying this data requires specialized expertise and is often tedious and repetitive. Leveraging large language models (LLMs) to develop an automated workflow presents a highly promising solution. However, LLMs are not adept at handling complex numerical computations and table manipulations and are also constrained by a limited context budget. Based on this, we propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests. The advancements are twofold: First, it is a code-centric agent that receives human requests and generates code as an intermediary to handle massive data, which is quite flexible for large-scale data processing tasks. Second, Data-Copilot involves a data exploration phase in advance, which explores how to design more universal and error-free interfaces for real-time response. Specifically, it actively explores data sources, discovers numerous common requests, and abstracts them into many universal interfaces for daily invocation. When deployed in real-time requests, Data-Copilot only needs to invoke these pre-designed interfaces, transforming raw data into visualized outputs (e.g., charts, tables) that best match the user's intent. Compared to generating code from scratch, invoking these pre-designed and compiler-validated interfaces can significantly reduce errors during real-time requests. Additionally, interface workflows are more efficient and offer greater interpretability than code. We open-sourced Data-Copilot with massive Chinese financial data, such as stocks, funds, and news, demonstrating promising application prospects.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - Data Interpreter: An LLM Agent For Data Science [43.13678782387546]
Large Language Model (LLM)-based agents have shown effectiveness across many applications.
However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging.
We present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end.
arXiv Detail & Related papers (2024-02-28T19:49:55Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z) - In-depth Analysis On Parallel Processing Patterns for High-Performance
Dataframes [0.0]
We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon.
In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns.
We evaluate the performance of Cylon on the ORNL Summit supercomputer.
arXiv Detail & Related papers (2023-07-03T23:11:03Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Machine Learning for Temporal Data in Finance: Challenges and
Opportunities [0.0]
Temporal data are ubiquitous in the financial services (FS) industry.
But machine learning efforts often fail to account for the temporal richness of these data.
arXiv Detail & Related papers (2020-09-11T19:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.