Related papers: The Design of an LLM-powered Unstructured Analytics System

The Design of an LLM-powered Unstructured Analytics System

URL: http://arxiv.org/abs/2409.00847v3
Date: Sat, 28 Dec 2024 05:14:14 GMT
Title: The Design of an LLM-powered Unstructured Analytics System
Authors: Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry Lindeman, Alex Meyer, Parth Parmar, Tanvi Ranade, Mehul A. Shah, Benjamin Sowell, Dan Tecuci, Vinayak Thapliyal, Matt Welsh,
Abstract summary: We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design.<n>Aryn includes Luna, a query planner that translates natural language queries to Sycamore scripts, and DocParse, which takes raw PDFs and document images.<n>We show how these pieces come together to achieve better accuracy than RAG on analytics queries over real world reports from the National Transportation Safety Board (NTSB)
Score: 0.3622961597765815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents. At the core of Aryn is Sycamore, a declarative document processing engine, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn includes Luna, a query planner that translates natural language queries to Sycamore scripts, and DocParse, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. We show how these pieces come together to achieve better accuracy than RAG on analytics queries over real world reports from the National Transportation Safety Board (NTSB). Also, given current limitations of LLMs, we argue that an analytics system must provide explainability to be practical, and show how Aryn's user interface does this to help build trust.

Related papers

UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis [7.057192434574117]
We propose a unified relation prediction approach for HDSA, called UniHDSA. UniHDSA treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures.
arXiv Detail & Related papers (2025-03-20T06:44:47Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text [0.848663031844483]
This paper identifies the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available. We show that the critiquing Agent is able to rectify scores from LLM evaluators, thereby reducing the need for labeled data even for complex NLG evaluation scenarios.
arXiv Detail & Related papers (2024-11-25T04:07:16Z)
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing [12.555427275787174]
We present DocETL, a system that optimize complex document processing pipelines. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based framework to automatically optimize them. We show that DocETL finds plans with outputs that are $1.34$ to $4.6times$ higher quality than well-engineered baselines.
arXiv Detail & Related papers (2024-10-16T03:22:35Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery [1.6080795642111267]
This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline. This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations. A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology. A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data.
arXiv Detail & Related papers (2024-05-16T13:17:14Z)
IQLS: Framework for leveraging Metadata to enable Large Language Model based queries to complex, versatile Data [0.20482269513546458]
The Intelligent Query and Learning System (IQLS) simplifies the process by allowing natural language use to simplify data retrieval. It maps structured data into a framework based on the available metadata and available data models. The IQLS enables the agent to fulfill tasks given by the user query through interfaces.
arXiv Detail & Related papers (2024-05-04T13:44:05Z)
LLM Based Multi-Agent Generation of Semi-structured Documents from Semantic Templates in the Public Administration Domain [2.3999111269325266]
Large Language Models (LLMs) have enabled the creation of customized text output satisfying user requests. We propose a novel approach that combines the LLMs with prompt engineering and multi-agent systems for generating new documents compliant with a desired structure.
arXiv Detail & Related papers (2024-02-21T13:54:53Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Instruct and Extract: Instruction Tuning for On-Demand Information Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users. We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set. Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z)
PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. We propose PDFTriage that enables models to retrieve the context based on either structure or content. Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z)
Decomposing Complex Queries for Tip-of-the-tongue Retrieval [72.07449449115167]
Complex queries describe content elements (e.g., book characters or events), information beyond the document text. This retrieval setting, called tip of the tongue (TOT), is especially challenging for models reliant on lexical and semantic overlap between query and document text. We introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
arXiv Detail & Related papers (2023-05-24T11:43:40Z)
Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms. Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time. Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.