LLMs with User-defined Prompts as Generic Data Operators for Reliable
Data Processing
- URL: http://arxiv.org/abs/2312.16351v1
- Date: Tue, 26 Dec 2023 23:08:38 GMT
- Title: LLMs with User-defined Prompts as Generic Data Operators for Reliable
Data Processing
- Authors: Luyi Ma, Nikhil Thakurdesai, Jiao Chen, Jianpeng Xu, Evren Korpeoglu,
Sushant Kumar, Kannan Achan
- Abstract summary: We propose a new design pattern that large language models (LLMs) could work as a generic data operator (LLM-GDO)
In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language.
Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware.
- Score: 13.901862478287509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data processing is one of the fundamental steps in machine learning pipelines
to ensure data quality. Majority of the applications consider the user-defined
function (UDF) design pattern for data processing in databases. Although the
UDF design pattern introduces flexibility, reusability and scalability, the
increasing demand on machine learning pipelines brings three new challenges to
this design pattern -- not low-code, not dependency-free and not
knowledge-aware. To address these challenges, we propose a new design pattern
that large language models (LLMs) could work as a generic data operator
(LLM-GDO) for reliable data cleansing, transformation and modeling with their
human-compatible performance. In the LLM-GDO design pattern, user-defined
prompts (UDPs) are used to represent the data processing logic rather than
implementations with a specific programming language. LLMs can be centrally
maintained so users don't have to manage the dependencies at the run-time.
Fine-tuning LLMs with domain-specific data could enhance the performance on the
domain-specific tasks which makes data processing knowledge-aware. We
illustrate these advantages with examples in different data processing tasks.
Furthermore, we summarize the challenges and opportunities introduced by LLMs
to provide a complete view of this design pattern for more discussions.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - ProcessTBench: An LLM Plan Generation Dataset for Process Mining [0.0]
Large Language Models (LLMs) have shown significant promise in plan generation.
Existing datasets often lack the complexity needed for advanced tool use scenarios.
We present the ProcessTBench synthetic dataset, an extension of the TaskBench dataset.
arXiv Detail & Related papers (2024-09-13T20:56:21Z) - Multi-agent Planning using Visual Language Models [2.2369578015657954]
Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks.
LLMs andVLMs can produce erroneous results, especially when a deep understanding of the problem domain is required.
We propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input.
arXiv Detail & Related papers (2024-08-10T08:10:17Z) - Relational Database Augmented Large Language Model [59.38841050766026]
Large language models (LLMs) excel in many natural language processing (NLP) tasks.
They can only incorporate new knowledge through training or supervised fine-tuning processes.
This precise, up-to-date, and private information is typically stored in relational databases.
arXiv Detail & Related papers (2024-07-21T06:19:10Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation [16.836658183451764]
Large Language Models (LLMs) have recently shown promise in streamlining hardware design processes by encapsulating vast amounts of domain-specific data.
Existing publicly available hardware datasets are often limited in size, complexity, or detail.
We propose a Multi-Grained-Verilog (MG-Verilog) dataset, which encompasses descriptions at various levels of detail and corresponding code samples.
arXiv Detail & Related papers (2024-07-02T03:21:24Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - UniDM: A Unified Framework for Data Manipulation with Large Language Models [66.61466011795798]
Large Language Models (LLMs) resolve multiple data manipulation tasks.
LLMs exhibit bright benefits in terms of performance but still require customized designs to fit each specific task.
We propose UniDM, a unified framework which establishes a new paradigm to process data manipulation tasks.
arXiv Detail & Related papers (2024-05-10T14:44:04Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.