Related papers: Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow

Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow

URL: http://arxiv.org/abs/2306.07209v7
Date: Sat, 05 Oct 2024 22:55:15 GMT
Title: Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow
Authors: Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang,
Abstract summary: Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
Score: 49.724842920942024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Industries such as finance, meteorology, and energy generate vast amounts of data daily. Efficiently managing, processing, and displaying this data requires specialized expertise and is often tedious and repetitive. Leveraging large language models (LLMs) to develop an automated workflow presents a highly promising solution. However, LLMs are not adept at handling complex numerical computations and table manipulations and are also constrained by a limited context budget. Based on this, we propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests. The advancements are twofold: First, it is a code-centric agent that receives human requests and generates code as an intermediary to handle massive data, which is quite flexible for large-scale data processing tasks. Second, Data-Copilot involves a data exploration phase in advance, which explores how to design more universal and error-free interfaces for real-time response. Specifically, it actively explores data sources, discovers numerous common requests, and abstracts them into many universal interfaces for daily invocation. When deployed in real-time requests, Data-Copilot only needs to invoke these pre-designed interfaces, transforming raw data into visualized outputs (e.g., charts, tables) that best match the user's intent. Compared to generating code from scratch, invoking these pre-designed and compiler-validated interfaces can significantly reduce errors during real-time requests. Additionally, interface workflows are more efficient and offer greater interpretability than code. We open-sourced Data-Copilot with massive Chinese financial data, such as stocks, funds, and news, demonstrating promising application prospects.

Related papers

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents [85.02904078131682]
We introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets.<n> ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic.<n>All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
arXiv Detail & Related papers (2025-10-28T17:53:13Z)
CoDA: Agentic Systems for Collaborative Data Visualization [57.270599188947294]
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations.<n>Existing approaches, including simple single- or multi-agent systems, often oversimplify the task.<n>We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection.
arXiv Detail & Related papers (2025-10-03T17:30:16Z)
Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures [27.473508984130728]
Large-scale scientific collaborations generate petabytes of data, with volumes soon expected to reach exabytes.<n>To manage these computational and storage demands, centralized workflow and data management systems are implemented.<n>A significant obstacle in adopting more effective or AI-driven solutions is the absence of a quick and reliable introspective dynamic model.
arXiv Detail & Related papers (2025-06-24T12:42:36Z)
AutoData: A Multi-Agent System for Open Web Data Collection [37.832257245199365]
AutoData is a novel multi-agent system for Automated web Data collection that requires minimal human intervention.<n>Instruct2DS is a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports.
arXiv Detail & Related papers (2025-05-21T04:32:35Z)
DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science [4.1431677219677185]
DatawiseAgent is a notebook-centric agent framework that unifies interactions among user, agent and the computational environment. It orchestrates four stages, including DSF-like planning, incremental execution, self-ging, and post-filtering. It consistently outperforms or matches state-of-the-art methods across multiple model settings.
arXiv Detail & Related papers (2025-03-10T08:32:33Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
Data Interpreter: An LLM Agent For Data Science [43.13678782387546]
Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. We present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end.
arXiv Detail & Related papers (2024-02-28T19:49:55Z)
Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly. deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z)
In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes [0.0]
We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. We evaluate the performance of Cylon on the ORNL Summit supercomputer.
arXiv Detail & Related papers (2023-07-03T23:11:03Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z)
Machine Learning for Temporal Data in Finance: Challenges and Opportunities [0.0]
Temporal data are ubiquitous in the financial services (FS) industry. But machine learning efforts often fail to account for the temporal richness of these data.
arXiv Detail & Related papers (2020-09-11T19:39:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.