Language Models Enable Simple Systems for Generating Structured Views of
Heterogeneous Data Lakes
- URL: http://arxiv.org/abs/2304.09433v2
- Date: Thu, 20 Apr 2023 04:12:38 GMT
- Title: Language Models Enable Simple Systems for Generating Structured Views of
Heterogeneous Data Lakes
- Authors: Simran Arora and Brandon Yang and Sabri Eyuboglu and Avanika Narayan
and Andrew Hojel and Immanuel Trummer and Christopher R\'e
- Abstract summary: EVAPORATE is a prototype system powered by large language models (LLMs)
Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.
We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
- Score: 15.214583657626697
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A long standing goal of the data management community is to develop general,
automated systems that ingest semi-structured documents and output queryable
tables without human effort or domain specific customization. Given the sheer
variety of potential documents, state-of-the art systems make simplifying
assumptions and use domain specific training. In this work, we ask whether we
can maintain generality by using large language models (LLMs). LLMs, which are
pretrained on broad data, can perform diverse downstream tasks simply
conditioned on natural language task descriptions.
We propose and evaluate EVAPORATE, a simple, prototype system powered by
LLMs. We identify two fundamentally different strategies for implementing this
system: prompt the LLM to directly extract values from documents or prompt the
LLM to synthesize code that performs the extraction. Our evaluations show a
cost-quality tradeoff between these two approaches. Code synthesis is cheap,
but far less accurate than directly processing each document with the LLM. To
improve quality while maintaining low cost, we propose an extended code
synthesis implementation, EVAPORATE-CODE+, which achieves better quality than
direct extraction. Our key insight is to generate many candidate functions and
ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only
outperforms the state-of-the art systems, but does so using a sublinear pass
over the documents with the LLM. This equates to a 110x reduction in the number
of tokens the LLM needs to process, averaged across 16 real-world evaluation
settings of 10k documents each.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models [28.105271954633682]
We introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak information to Large Language Models (LLMs)
We utilize the query to extract the top-$k$ tokens from input documents, serving as contextual clues.
We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents.
arXiv Detail & Related papers (2024-04-06T06:44:41Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - LMDX: Language Model-based Document Information Extraction and Localization [23.656970495804963]
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP)
Their application in extracting information from visually rich documents has not yet been successful.
Main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs.
arXiv Detail & Related papers (2023-09-19T22:32:56Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM.
It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses.
We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.