HADES: Homologous Automated Document Exploration and Summarization
- URL: http://arxiv.org/abs/2302.13099v1
- Date: Sat, 25 Feb 2023 15:16:10 GMT
- Title: HADES: Homologous Automated Document Exploration and Summarization
- Authors: Piotr Wilczy\'nski, Artur \.Z\'o{\l}kowski, Mateusz Krzyzi\'nski,
Emilia Wi\'snios, Bartosz Pieli\'nski, Stanis{\l}aw Gizi\'nski, Julian
Sienkiewicz, Przemys{\l}aw Biecek
- Abstract summary: HADES is designed to streamline the work of professionals dealing with large volumes of documents.
The tool employs a multi-step pipeline that begins with processing PDF documents using topic modeling, summarization, and analysis of the most important words for each topic.
- Score: 3.3509104620016092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces HADES, a novel tool for automatic comparative documents
with similar structures. HADES is designed to streamline the work of
professionals dealing with large volumes of documents, such as policy
documents, legal acts, and scientific papers. The tool employs a multi-step
pipeline that begins with processing PDF documents using topic modeling,
summarization, and analysis of the most important words for each topic. The
process concludes with an interactive web app with visualizations that
facilitate the comparison of the documents. HADES has the potential to
significantly improve the productivity of professionals dealing with high
volumes of documents, reducing the time and effort required to complete tasks
related to comparative document analysis. Our package is publically available
on GitHub.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - Functional Analytics for Document Ordering for Curriculum Development
and Comprehension [0.0]
We propose techniques for automatic document order generation for curriculum development and for creation of optimal reading order for use in learning, training, and other content-sequencing applications.
Such techniques could potentially be used to improve comprehension, identify areas that need expounding, generate curricula, and improve search engine results.
arXiv Detail & Related papers (2023-11-22T02:13:27Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - The Law of Large Documents: Understanding the Structure of Legal
Contracts Using Visual Cues [0.7425558351422133]
We measure the impact of incorporating visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks.
Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks.
arXiv Detail & Related papers (2021-07-16T21:21:50Z) - Automatic Document Sketching: Generating Drafts from Analogous Texts [44.626645471195495]
We introduce a new task, document sketching, which involves generating entire draft documents for the writer to review and revise.
These drafts are built from sets of documents that overlap in form - sharing large segments of potentially reusable text - while diverging in content.
We investigate the application of weakly supervised methods, including use of a transformer-based mixture of experts, together with reinforcement learning.
arXiv Detail & Related papers (2021-06-14T06:46:06Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.