Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
- URL: http://arxiv.org/abs/2304.09433v3
- Date: Fri, 07 Mar 2025 17:33:50 GMT
- Title: Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
- Authors: Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher RĂ©,
- Abstract summary: EVAPORATE is a prototype system powered by large language models (LLMs)<n>Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.<n>We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
- Score: 54.13559879916708
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.
Related papers
- A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis [43.746749403268275]
Large Language Models (LLMs) suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures.
We propose a collaborative framework, GRA, that aggregates specialized roles across small LLMs to generate high-quality, diverse, and reliable data.
Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents.
arXiv Detail & Related papers (2025-04-11T06:13:43Z) - From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research [13.818244562506138]
Large Language Models (LLMs) provide a cost-effective and efficient alternative to human annotation.
This paper introduces the SILICON" (Systematic Inference with LLMs for Information Classification and Notation) workflow.
The workflow integrates established principles of human annotation with systematic prompt optimization and model selection.
arXiv Detail & Related papers (2024-12-19T02:21:41Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models [28.105271954633682]
We introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak information to Large Language Models (LLMs)
We utilize the query to extract the top-$k$ tokens from input documents, serving as contextual clues.
We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents.
arXiv Detail & Related papers (2024-04-06T06:44:41Z) - LMDX: Language Model-based Document Information Extraction and Localization [23.656970495804963]
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP)
Their application in extracting information from visually rich documents has not yet been successful.
Main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs.
arXiv Detail & Related papers (2023-09-19T22:32:56Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.