LLMs with User-defined Prompts as Generic Data Operators for Reliable
  Data Processing
        - URL: http://arxiv.org/abs/2312.16351v1
- Date: Tue, 26 Dec 2023 23:08:38 GMT
- Title: LLMs with User-defined Prompts as Generic Data Operators for Reliable
  Data Processing
- Authors: Luyi Ma, Nikhil Thakurdesai, Jiao Chen, Jianpeng Xu, Evren Korpeoglu,
  Sushant Kumar, Kannan Achan
- Abstract summary: We propose a new design pattern that large language models (LLMs) could work as a generic data operator (LLM-GDO)
In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language.
Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware.
- Score: 13.901862478287509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Data processing is one of the fundamental steps in machine learning pipelines
to ensure data quality. Majority of the applications consider the user-defined
function (UDF) design pattern for data processing in databases. Although the
UDF design pattern introduces flexibility, reusability and scalability, the
increasing demand on machine learning pipelines brings three new challenges to
this design pattern -- not low-code, not dependency-free and not
knowledge-aware. To address these challenges, we propose a new design pattern
that large language models (LLMs) could work as a generic data operator
(LLM-GDO) for reliable data cleansing, transformation and modeling with their
human-compatible performance. In the LLM-GDO design pattern, user-defined
prompts (UDPs) are used to represent the data processing logic rather than
implementations with a specific programming language. LLMs can be centrally
maintained so users don't have to manage the dependencies at the run-time.
Fine-tuning LLMs with domain-specific data could enhance the performance on the
domain-specific tasks which makes data processing knowledge-aware. We
illustrate these advantages with examples in different data processing tasks.
Furthermore, we summarize the challenges and opportunities introduced by LLMs
to provide a complete view of this design pattern for more discussions.
 
      
        Related papers
        - Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB [44.057784044659726]
 Large language models (LLMs) have made it easier to prototype such retrieval and reasoning data pipelines.
This often involves orchestrating data systems, managing data movement, and handling low-level details.
We introduce FlockMTL: an extension for abstractions that integrates deeply LLM capabilities and retrieval-augmented generation.
 arXiv  Detail & Related papers  (2025-04-01T19:48:17Z)
- LLM-Powered Proactive Data Systems [3.21573589381478]
 Most data systems treat LLMs as an opaque black box that operates on user inputs and data as is.
We argue that data systems need to be given more agency to understand and rework the user inputs and the data.
 arXiv  Detail & Related papers  (2025-02-18T16:34:45Z)
- Interactive and Expressive Code-Augmented Planning with Large Language   Models [62.799579304821826]
 Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making.
Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance.
We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
 arXiv  Detail & Related papers  (2024-11-21T04:23:17Z)
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through   Failure-Inducing Exploration [90.41908331897639]
 Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
 arXiv  Detail & Related papers  (2024-10-22T06:43:28Z)
- ProcessTBench: An LLM Plan Generation Dataset for Process Mining [0.0]
 Large Language Models (LLMs) have shown significant promise in plan generation.
Existing datasets often lack the complexity needed for advanced tool use scenarios.
We present the ProcessTBench synthetic dataset, an extension of the TaskBench dataset.
 arXiv  Detail & Related papers  (2024-09-13T20:56:21Z)
- Multi-agent Planning using Visual Language Models [2.2369578015657954]
 Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks.
LLMs andVLMs can produce erroneous results, especially when a deep understanding of the problem domain is required.
We propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input.
 arXiv  Detail & Related papers  (2024-08-10T08:10:17Z)
- Relational Database Augmented Large Language Model [59.38841050766026]
 Large language models (LLMs) excel in many natural language processing (NLP) tasks.
They can only incorporate new knowledge through training or supervised fine-tuning processes.
This precise, up-to-date, and private information is typically stored in relational databases.
 arXiv  Detail & Related papers  (2024-07-21T06:19:10Z)
- SELF-GUIDE: Better Task-Specific Instruction Following via   Self-Synthetic Finetuning [70.21358720599821]
 Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
 arXiv  Detail & Related papers  (2024-07-16T04:41:58Z)
- MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog   Generation [16.836658183451764]
 Large Language Models (LLMs) have recently shown promise in streamlining hardware design processes by encapsulating vast amounts of domain-specific data.
Existing publicly available hardware datasets are often limited in size, complexity, or detail.
We propose a Multi-Grained-Verilog (MG-Verilog) dataset, which encompasses descriptions at various levels of detail and corresponding code samples.
 arXiv  Detail & Related papers  (2024-07-02T03:21:24Z)
- Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
 Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
 arXiv  Detail & Related papers  (2024-06-19T00:28:58Z)
- UniDM: A Unified Framework for Data Manipulation with Large Language   Models [66.61466011795798]
 Large Language Models (LLMs) resolve multiple data manipulation tasks.
LLMs exhibit bright benefits in terms of performance but still require customized designs to fit each specific task.
We propose UniDM, a unified framework which establishes a new paradigm to process data manipulation tasks.
 arXiv  Detail & Related papers  (2024-05-10T14:44:04Z)
- Making Large Language Models Better Data Creators [22.0882632635255]
 Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
 deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
 arXiv  Detail & Related papers  (2023-10-31T01:08:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.