Related papers: Magic Markup: Maintaining Document-External Markup with an LLM

Magic Markup: Maintaining Document-External Markup with an LLM

URL: http://arxiv.org/abs/2403.03481v1
Date: Wed, 6 Mar 2024 05:40:31 GMT
Title: Magic Markup: Maintaining Document-External Markup with an LLM
Authors: Edward Misback, Zachary Tatlock, Steven L. Tanimoto
Abstract summary: We present a system that re-tags modified programs, enabling rich annotations to automatically follow code as it evolves. Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag. While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
Score: 1.0538052824177144
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Text documents, including programs, typically have human-readable semantic structure. Historically, programmatic access to these semantics has required explicit in-document tagging. Especially in systems where the text has an execution semantics, this means it is an opt-in feature that is hard to support properly. Today, language models offer a new method: metadata can be bound to entities in changing text using a model's human-like understanding of semantics, with no requirements on the document structure. This method expands the applications of document annotation, a fundamental operation in program writing, debugging, maintenance, and presentation. We contribute a system that employs an intelligent agent to re-tag modified programs, enabling rich annotations to automatically follow code as it evolves. We also contribute a formal problem definition, an empirical synthetic benchmark suite, and our benchmark generator. Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag. While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.

Related papers

Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents [7.358946120326249]
We introduce 'Eclair, a text-extraction tool specifically designed to process a wide range of document types. Given an image, 'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. 'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics.
arXiv Detail & Related papers (2025-02-06T17:07:22Z)
Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for Document-level Machine Translation [11.36816954288264]
This paper introduces Doc-Guided Sent2Sent++, an Agent that employs an incremental sentence-level forced decoding strategy. We demonstrate that Sent2Sent++ outperforms other methods in terms of quality, consistency, and fluency.
arXiv Detail & Related papers (2025-01-15T02:25:35Z)
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations [22.336858733121158]
We introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources. We conduct a thorough evaluation of both pipeline-based methods and end-to-end vision-language models.
arXiv Detail & Related papers (2024-12-10T16:05:56Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time. Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding [35.35388421383703]
Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU) We propose MarkupLM for document understanding tasks with markup languages as the backbone. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.
arXiv Detail & Related papers (2021-10-16T09:17:28Z)
SenTag: a Web-based Tool for Semantic Annotation of Textual Documents [4.910379177401659]
SenTag is a web-based tool focused on semantic annotation of textual documents. The main goal of the application is two-fold: facilitating the tagging process and reducing or avoiding for errors in the output documents. It is also possible to assess the level of agreement of annotators working on a corpus of text.
arXiv Detail & Related papers (2021-09-16T08:39:33Z)
Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems. We generate document representations that capture both text and metadata artifacts in a task manner. Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions [40.64025648548128]
We develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and filters, and evaluate it on a standard sentence-level benchmark. HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively.
arXiv Detail & Related papers (2020-10-11T01:16:10Z)
Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
SPECTER: Document-level Representation Learning using Citation-informed Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model. We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.