Timestamping Documents and Beliefs
- URL: http://arxiv.org/abs/2106.14622v1
- Date: Wed, 9 Jun 2021 02:12:18 GMT
- Title: Timestamping Documents and Beliefs
- Authors: Swayambhu Nath Ray
- Abstract summary: Document dating is a challenging problem which requires inference over the temporal structure of the document.
In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach.
We also propose AD3: Attentive Deep Document Dater, an attention-based document dating system.
- Score: 1.4467794332678539
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most of the textual information available to us are temporally variable. In a
world where information is dynamic, time-stamping them is a very important
task. Documents are a good source of information and are used for many tasks
like, sentiment analysis, classification of reviews etc. The knowledge of
creation date of documents facilitates several tasks like summarization, event
extraction, temporally focused information extraction etc. Unfortunately, for
most of the documents on the web, the time-stamp meta-data is either erroneous
or missing. Thus document dating is a challenging problem which requires
inference over the temporal structure of the document alongside the contextual
information of the document. Prior document dating systems have largely relied
on handcrafted features while ignoring such document-internal structures. In
this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based
document dating approach which jointly exploits syntactic and temporal graph
structures of document in a principled way. We also pointed out some
limitations of NeuralDater and tried to utilize both context and temporal
information in documents in a more flexible and intuitive manner proposing AD3:
Attentive Deep Document Dater, an attention-based document dating system. To
the best of our knowledge these are the first application of deep learning
methods for the task. Through extensive experiments on real-world datasets, we
find that our models significantly outperforms state-of-the-art baselines by a
significant margin.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - DLUE: Benchmarking Document Language Understanding [32.550855843975484]
There is no well-established consensus on how to comprehensively evaluate document understanding abilities.
This paper summarizes four representative abilities, i.e., document classification, document structural analysis, document information extraction, and document transcription.
Under the new evaluation framework, we propose textbfDocument Language Understanding Evaluation -- textbfDLUE, a new task suite.
arXiv Detail & Related papers (2023-05-16T15:16:24Z) - Cross-Modal Entity Matching for Visually Rich Documents [4.8119678510491815]
Visually rich documents utilize visual cues to augment their semantics.
Existing works that enable structured querying on these documents do not take this into account.
We propose Juno -- a cross-modal entity matching framework to address this limitation.
arXiv Detail & Related papers (2023-03-01T18:26:14Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - The Law of Large Documents: Understanding the Structure of Legal
Contracts Using Visual Cues [0.7425558351422133]
We measure the impact of incorporating visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks.
Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks.
arXiv Detail & Related papers (2021-07-16T21:21:50Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.