Understanding Archives: Towards New Research Interfaces Relying on the Semantic Annotation of Documents
- URL: http://arxiv.org/abs/2403.19201v1
- Date: Thu, 28 Mar 2024 07:55:29 GMT
- Title: Understanding Archives: Towards New Research Interfaces Relying on the Semantic Annotation of Documents
- Authors: Nicolas Gutehrlé, Iana Atanassova,
- Abstract summary: We show how the semantic annotation of the textual content of study corpora of archival documents allow to facilitate their exploitation and valorisation.
First, we present a methodological framework for the construction of new interfaces based on textual semantics, then address the current technological obstacles and their potential solutions.
- Score: 0.2302001830524133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The digitisation campaigns carried out by libraries and archives in recent years have facilitated access to documents in their collections. However, exploring and exploiting these documents remain difficult tasks due to the sheer quantity of documents available for consultation. In this article, we show how the semantic annotation of the textual content of study corpora of archival documents allow to facilitate their exploitation and valorisation. First, we present a methodological framework for the construction of new interfaces based on textual semantics, then address the current technological obstacles and their potential solutions. We conclude by presenting a practical case of the application of this framework.
Related papers
- Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task.
We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z) - DLUE: Benchmarking Document Language Understanding [32.550855843975484]
There is no well-established consensus on how to comprehensively evaluate document understanding abilities.
This paper summarizes four representative abilities, i.e., document classification, document structural analysis, document information extraction, and document transcription.
Under the new evaluation framework, we propose textbfDocument Language Understanding Evaluation -- textbfDLUE, a new task suite.
arXiv Detail & Related papers (2023-05-16T15:16:24Z) - Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline
Generation over Historical Document Collections [17.332692582748408]
We propose to extend TimeLine Summarization (TLS) methods on archive collections to assist in their studies.
We describe a conceptual framework for an Archive TimeLine Summarization (ATLS) system, which aims to generate informative, readable and interpretable timelines.
arXiv Detail & Related papers (2023-01-31T08:58:47Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Embedding Knowledge for Document Summarization: A Survey [66.76415502727802]
Previous works proved that knowledge-embedded document summarizers excel at generating superior digests.
We propose novel to recapitulate knowledge and knowledge embeddings under the document summarization view.
arXiv Detail & Related papers (2022-04-24T04:36:07Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - The Law of Large Documents: Understanding the Structure of Legal
Contracts Using Visual Cues [0.7425558351422133]
We measure the impact of incorporating visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks.
Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks.
arXiv Detail & Related papers (2021-07-16T21:21:50Z) - A Survey of Deep Learning Approaches for OCR and Document Understanding [68.65995739708525]
We review different techniques for document understanding for documents written in English.
We consolidate methodologies present in literature to act as a jumping-off point for researchers exploring this area.
arXiv Detail & Related papers (2020-11-27T03:05:59Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.