Language Models Enable Simple Systems for Generating Structured Views of
Heterogeneous Data Lakes
- URL: http://arxiv.org/abs/2304.09433v2
- Date: Thu, 20 Apr 2023 04:12:38 GMT
- Title: Language Models Enable Simple Systems for Generating Structured Views of
Heterogeneous Data Lakes
- Authors: Simran Arora and Brandon Yang and Sabri Eyuboglu and Avanika Narayan
and Andrew Hojel and Immanuel Trummer and Christopher R\'e
- Abstract summary: EVAPORATE is a prototype system powered by large language models (LLMs)
Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.
We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
- Score: 15.214583657626697
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A long standing goal of the data management community is to develop general,
automated systems that ingest semi-structured documents and output queryable
tables without human effort or domain specific customization. Given the sheer
variety of potential documents, state-of-the art systems make simplifying
assumptions and use domain specific training. In this work, we ask whether we
can maintain generality by using large language models (LLMs). LLMs, which are
pretrained on broad data, can perform diverse downstream tasks simply
conditioned on natural language task descriptions.
We propose and evaluate EVAPORATE, a simple, prototype system powered by
LLMs. We identify two fundamentally different strategies for implementing this
system: prompt the LLM to directly extract values from documents or prompt the
LLM to synthesize code that performs the extraction. Our evaluations show a
cost-quality tradeoff between these two approaches. Code synthesis is cheap,
but far less accurate than directly processing each document with the LLM. To
improve quality while maintaining low cost, we propose an extended code
synthesis implementation, EVAPORATE-CODE+, which achieves better quality than
direct extraction. Our key insight is to generate many candidate functions and
ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only
outperforms the state-of-the art systems, but does so using a sublinear pass
over the documents with the LLM. This equates to a 110x reduction in the number
of tokens the LLM needs to process, averaged across 16 real-world evaluation
settings of 10k documents each.
Related papers
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models [28.105271954633682]
We introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak information to Large Language Models (LLMs)
We utilize the query to extract the top-$k$ tokens from input documents, serving as contextual clues.
We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents.
arXiv Detail & Related papers (2024-04-06T06:44:41Z) - LMDX: Language Model-based Document Information Extraction and Localization [23.656970495804963]
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP)
Their application in extracting information from visually rich documents has not yet been successful.
Main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs.
arXiv Detail & Related papers (2023-09-19T22:32:56Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.