Magic Markup: Maintaining Document-External Markup with an LLM
- URL: http://arxiv.org/abs/2403.03481v1
- Date: Wed, 6 Mar 2024 05:40:31 GMT
- Title: Magic Markup: Maintaining Document-External Markup with an LLM
- Authors: Edward Misback, Zachary Tatlock, Steven L. Tanimoto
- Abstract summary: We present a system that re-tags modified programs, enabling rich annotations to automatically follow code as it evolves.
Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag.
While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
- Score: 1.0538052824177144
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Text documents, including programs, typically have human-readable semantic
structure. Historically, programmatic access to these semantics has required
explicit in-document tagging. Especially in systems where the text has an
execution semantics, this means it is an opt-in feature that is hard to support
properly. Today, language models offer a new method: metadata can be bound to
entities in changing text using a model's human-like understanding of
semantics, with no requirements on the document structure. This method expands
the applications of document annotation, a fundamental operation in program
writing, debugging, maintenance, and presentation. We contribute a system that
employs an intelligent agent to re-tag modified programs, enabling rich
annotations to automatically follow code as it evolves. We also contribute a
formal problem definition, an empirical synthetic benchmark suite, and our
benchmark generator. Our system achieves an accuracy of 90% on our benchmarks
and can replace a document's tags in parallel at a rate of 5 seconds per tag.
While there remains significant room for improvement, we find performance
reliable enough to justify further exploration of applications.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - SelfDocSeg: A Self-Supervised vision-based Approach towards Document
Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community.
With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain.
We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - MarkupLM: Pre-training of Text and Markup Language for Visually-rich
Document Understanding [35.35388421383703]
Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU)
We propose MarkupLM for document understanding tasks with markup languages as the backbone.
Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.
arXiv Detail & Related papers (2021-10-16T09:17:28Z) - SenTag: a Web-based Tool for Semantic Annotation of Textual Documents [4.910379177401659]
SenTag is a web-based tool focused on semantic annotation of textual documents.
The main goal of the application is two-fold: facilitating the tagging process and reducing or avoiding for errors in the output documents.
It is also possible to assess the level of agreement of annotators working on a corpus of text.
arXiv Detail & Related papers (2021-09-16T08:39:33Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Document-Level Definition Detection in Scholarly Documents: Existing
Models, Error Analyses, and Future Directions [40.64025648548128]
We develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and filters, and evaluate it on a standard sentence-level benchmark.
HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively.
arXiv Detail & Related papers (2020-10-11T01:16:10Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.