A Survey of LLM $\times$ DATA
- URL: http://arxiv.org/abs/2505.18458v3
- Date: Sun, 01 Jun 2025 16:00:34 GMT
- Title: A Survey of LLM $\times$ DATA
- Authors: Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu,
- Abstract summary: The integration of large language model (LLM) and data management ( DATA4LLM) is rapidly redefining both domains.<n>On the one hand, DATA4LLM feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic.<n>On the other hand, LLMs are emerging as general-purpose engines for data management.
- Score: 71.96808497574658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
Related papers
- Exploring LLM Capabilities in Extracting DCAT-Compatible Metadata for Data Cataloging [0.1424853531377145]
Data catalogs can support and accelerate data exploration by using metadata to answer user queries.<n>This study investigates whether LLMs can automate metadata maintenance of text-based data and generate high-quality DCAT-compatible metadata.<n>Our results show that LLMs can generate metadata comparable to human-created content, particularly on tasks that require advanced semantic understanding.
arXiv Detail & Related papers (2025-07-04T10:49:37Z) - DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science [4.1431677219677185]
DatawiseAgent is a notebook-centric agent framework that unifies interactions among user, agent and the computational environment.<n>It orchestrates four stages, including DSF-like planning, incremental execution, self-ging, and post-filtering.<n>It consistently outperforms or matches state-of-the-art methods across multiple model settings.
arXiv Detail & Related papers (2025-03-10T08:32:33Z) - Federated Data-Efficient Instruction Tuning for Large Language Models [34.35613476734293]
Federated data-efficient instruction tuning for large language models, FedHDS, is presented.
It reduces the redundancy of data samples at both intra-client and inter-client levels.
Experiments show that FedHDS significantly reduces the amount of data required for fine-tuning while improving the responsiveness of the instruction-tuned LLMs to unseen tasks.
arXiv Detail & Related papers (2024-10-14T15:05:51Z) - 60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task.<n>We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs.<n>Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z) - Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely [8.507599833330346]
Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks.
Retrieval-Augmented Generation (RAG) and fine-tuning are gaining increasing attention and widespread application.
However, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges.
arXiv Detail & Related papers (2024-09-23T11:20:20Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - SEED: Domain-Specific Data Curation With Large Language Models [22.54280367957015]
We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs)
SEED features an that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand.
arXiv Detail & Related papers (2023-10-01T17:59:20Z) - Data-Juicer: A One-Stop Data Processing System for Large Language Models [73.27731037450995]
A data recipe is a mixture of data from different sources for training Large Language Models (LLMs)
We build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes.
The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs.
arXiv Detail & Related papers (2023-09-05T08:22:07Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.