Related papers: LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing

LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing

URL: http://arxiv.org/abs/2312.16351v1
Date: Tue, 26 Dec 2023 23:08:38 GMT
Title: LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing
Authors: Luyi Ma, Nikhil Thakurdesai, Jiao Chen, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan
Abstract summary: We propose a new design pattern that large language models (LLMs) could work as a generic data operator (LLM-GDO) In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language. Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware.
Score: 13.901862478287509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data processing is one of the fundamental steps in machine learning pipelines to ensure data quality. Majority of the applications consider the user-defined function (UDF) design pattern for data processing in databases. Although the UDF design pattern introduces flexibility, reusability and scalability, the increasing demand on machine learning pipelines brings three new challenges to this design pattern -- not low-code, not dependency-free and not knowledge-aware. To address these challenges, we propose a new design pattern that large language models (LLMs) could work as a generic data operator (LLM-GDO) for reliable data cleansing, transformation and modeling with their human-compatible performance. In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language. LLMs can be centrally maintained so users don't have to manage the dependencies at the run-time. Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware. We illustrate these advantages with examples in different data processing tasks. Furthermore, we summarize the challenges and opportunities introduced by LLMs to provide a complete view of this design pattern for more discussions.

Related papers

Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB [44.057784044659726]
Large language models (LLMs) have made it easier to prototype such retrieval and reasoning data pipelines. This often involves orchestrating data systems, managing data movement, and handling low-level details. We introduce FlockMTL: an extension for abstractions that integrates deeply LLM capabilities and retrieval-augmented generation.
arXiv Detail & Related papers (2025-04-01T19:48:17Z)
LLM-Powered Proactive Data Systems [3.21573589381478]
Most data systems treat LLMs as an opaque black box that operates on user inputs and data as is. We argue that data systems need to be given more agency to understand and rework the user inputs and the data.
arXiv Detail & Related papers (2025-02-18T16:34:45Z)
Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
ProcessTBench: An LLM Plan Generation Dataset for Process Mining [0.0]
Large Language Models (LLMs) have shown significant promise in plan generation. Existing datasets often lack the complexity needed for advanced tool use scenarios. We present the ProcessTBench synthetic dataset, an extension of the TaskBench dataset.
arXiv Detail & Related papers (2024-09-13T20:56:21Z)
Multi-agent Planning using Visual Language Models [2.2369578015657954]
Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. LLMs andVLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. We propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input.
arXiv Detail & Related papers (2024-08-10T08:10:17Z)
Relational Database Augmented Large Language Model [59.38841050766026]
Large language models (LLMs) excel in many natural language processing (NLP) tasks. They can only incorporate new knowledge through training or supervised fine-tuning processes. This precise, up-to-date, and private information is typically stored in relational databases.
arXiv Detail & Related papers (2024-07-21T06:19:10Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation [16.836658183451764]
Large Language Models (LLMs) have recently shown promise in streamlining hardware design processes by encapsulating vast amounts of domain-specific data. Existing publicly available hardware datasets are often limited in size, complexity, or detail. We propose a Multi-Grained-Verilog (MG-Verilog) dataset, which encompasses descriptions at various levels of detail and corresponding code samples.
arXiv Detail & Related papers (2024-07-02T03:21:24Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
UniDM: A Unified Framework for Data Manipulation with Large Language Models [66.61466011795798]
Large Language Models (LLMs) resolve multiple data manipulation tasks. LLMs exhibit bright benefits in terms of performance but still require customized designs to fit each specific task. We propose UniDM, a unified framework which establishes a new paradigm to process data manipulation tasks.
arXiv Detail & Related papers (2024-05-10T14:44:04Z)
Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly. deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.