Knowledge-Centric Templatic Views of Documents
- URL: http://arxiv.org/abs/2401.06945v1
- Date: Sat, 13 Jan 2024 01:22:15 GMT
- Title: Knowledge-Centric Templatic Views of Documents
- Authors: Isabel Cachola, Silviu Cucerzan, Allen Herring, Vuksan Mijovic, Erik
Oveson, Sujay Kumar Jauhar
- Abstract summary: Authors often compose ideas about the same underlying knowledge in different documents and formats.
Prior work in document generation has generally considered the creation of each separate format to be different a task.
This approach is suboptimal for the advancement of AI-supported content authoring from both research and application perspectives.
- Score: 2.8122829028152787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Authors seeking to communicate with broader audiences often compose their
ideas about the same underlying knowledge in different documents and formats --
for example, as slide decks, newsletters, reports, brochures, etc. Prior work
in document generation has generally considered the creation of each separate
format to be different a task, developing independent methods for generation
and evaluation. This approach is suboptimal for the advancement of AI-supported
content authoring from both research and application perspectives because it
leads to fragmented learning processes, redundancy in models and methods, and
disjointed evaluation. Thus, in our work, we consider each of these documents
to be templatic views of the same underlying knowledge, and we aim to unify the
generation and evaluation of these templatic views of documents. We begin by
introducing an LLM-powered method to extract the most important information
from an input document and represent this information in a structured format.
We show that this unified representation can be used to generate multiple
templatic views with no supervision and with very little guidance, improving
over strong baselines. We additionally introduce a unified evaluation method
that is template agnostic, and can be adapted to building document generators
for heterogeneous downstream applications. Finally, we conduct a human
evaluation, which shows that humans prefer 82% of the downstream documents
generated with our method. Furthermore, the newly proposed evaluation metric
correlates more highly with human judgement than prior metrics, while providing
a unified evaluation method.
Related papers
- CovScore: Evaluation of Multi-Document Abstractive Title Set Generation [16.516381474175986]
CovScore is an automatic reference-less methodology for evaluating thematic title sets.
We propose a novel methodology that decomposes quality into five main metrics along different aspects of evaluation.
arXiv Detail & Related papers (2024-07-24T16:14:15Z) - DocXplain: A Novel Model-Agnostic Explainability Method for Document Image Classification [5.247930659596986]
This paper introduces DocXplain, a novel model-agnostic explainability method specifically designed for generating high interpretability feature attribution maps.
We extensively evaluate our proposed approach in the context of document image classification, utilizing 4 different evaluation metrics.
To the best of the authors' knowledge, this work presents the first model-agnostic attribution-based explainability method specifically tailored for document images.
arXiv Detail & Related papers (2024-07-04T10:59:15Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Evaluating a Methodology for Increasing AI Transparency: A Case Study [8.265282762929509]
Given growing concerns about the potential harms of artificial intelligence, societies have begun to demand more transparency about how AI models and systems are created and used.
To address these concerns, several efforts have proposed documentation templates containing questions to be answered by model developers.
No single template can cover the needs of diverse documentation consumers.
arXiv Detail & Related papers (2022-01-24T20:01:01Z) - Value Retrieval with Arbitrary Queries for Form-like Documents [50.5532781148902]
We propose value retrieval with arbitrary queries for form-like documents.
Our method predicts target value for an arbitrary query based on the understanding of layout and semantics of a form.
We propose a simple document language modeling (simpleDLM) strategy to improve document understanding on large-scale model pre-training.
arXiv Detail & Related papers (2021-12-15T01:12:02Z) - Modeling Endorsement for Multi-Document Abstractive Summarization [10.166639983949887]
A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s)
In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization.
Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents.
arXiv Detail & Related papers (2021-10-15T03:55:42Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Automatic Document Sketching: Generating Drafts from Analogous Texts [44.626645471195495]
We introduce a new task, document sketching, which involves generating entire draft documents for the writer to review and revise.
These drafts are built from sets of documents that overlap in form - sharing large segments of potentially reusable text - while diverging in content.
We investigate the application of weakly supervised methods, including use of a transformer-based mixture of experts, together with reinforcement learning.
arXiv Detail & Related papers (2021-06-14T06:46:06Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.