Related papers: Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

URL: http://arxiv.org/abs/2304.09433v3
Date: Fri, 07 Mar 2025 17:33:50 GMT
Title: Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
Authors: Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Ré,
Abstract summary: EVAPORATE is a prototype system powered by large language models (LLMs)<n>Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.<n>We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
Score: 54.13559879916708
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.

Related papers

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z)
SHERPA: A Model-Driven Framework for Large Language Model Execution [5.595920727790909]
SHERPA is a model-driven framework to improve the performance of large language models (LLMs) on complex tasks.<n>By explicitly incorporating domain-specific best practices into hierarchical state machines, SHERPA enables more fine-grained control over their behavior.
arXiv Detail & Related papers (2025-08-29T23:14:21Z)
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [28.47810405584841]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis [43.746749403268275]
Large Language Models (LLMs) suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. We propose a collaborative framework, GRA, that aggregates specialized roles across small LLMs to generate high-quality, diverse, and reliable data. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents.
arXiv Detail & Related papers (2025-04-11T06:13:43Z)
From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research [13.818244562506138]
Large Language Models (LLMs) provide a cost-effective and efficient alternative to human annotation. This paper introduces the SILICON" (Systematic Inference with LLMs for Information Classification and Notation) workflow. The workflow integrates established principles of human annotation with systematic prompt optimization and model selection.
arXiv Detail & Related papers (2024-12-19T02:21:41Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
PECAN: LLM-Guided Dynamic Progress Control with Attention-Guided Hierarchical Weighted Graph for Long-Document QA [16.945257645760428]
Long-document QA presents challenges with large-scale text and long-distance dependencies.<n>Recent advances in Large Language Models (LLMs) enable entire documents to be processed in a single pass.<n>We propose a novel retrieval method that constructs a hierarchical graph where edges are derived by LLM attention weights.
arXiv Detail & Related papers (2024-10-07T07:02:09Z)
zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning. We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus. The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z)
Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs. This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models [28.105271954633682]
We introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak information to Large Language Models (LLMs) We utilize the query to extract the top-$k$ tokens from input documents, serving as contextual clues. We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents.
arXiv Detail & Related papers (2024-04-06T06:44:41Z)
LMDX: Language Model-based Document Information Extraction and Localization [23.656970495804963]
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP) Their application in extracting information from visually rich documents has not yet been successful. Main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs.
arXiv Detail & Related papers (2023-09-19T22:32:56Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.