Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction
- URL: http://arxiv.org/abs/2402.13906v2
- Date: Thu, 20 Jun 2024 14:43:51 GMT
- Title: Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction
- Authors: Gili Lior, Yoav Goldberg, Gabriel Stanovsky,
- Abstract summary: We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
- Score: 61.998789448260005
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.
Related papers
- Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data.
Document parsing plays an indispensable role in both knowledge base construction and training data generation.
This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - Topic Segmentation of Research Article Collections [4.0810783261728565]
We perform topic segmentation of a paper data collection that we crawled and produce a multitopic dataset of roughly seven million paper data records.
We construct a taxonomy of topics extracted from the data records and then annotate each document with its corresponding topic from that taxonomy.
It is possible to use this newly proposed dataset in two modalities: as a heterogeneous collection of documents from various disciplines or as a set of homogeneous collections, each from a single research topic.
arXiv Detail & Related papers (2022-05-18T15:19:42Z) - Large-Scale Multi-Document Summarization with Information Extraction and
Compression [31.601707033466766]
We develop an abstractive summarization framework independent of labeled data for multiple heterogeneous documents.
Our framework processes documents telling different stories instead of documents on the same topic.
Our experiments demonstrate that our framework outperforms current state-of-the-art methods in this more generic setting.
arXiv Detail & Related papers (2022-05-01T19:49:15Z) - Multi-View Document Representation Learning for Open-Domain Dense
Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework.
It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries.
Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Identifying Documents In-Scope of a Collection from Web Archives [37.34941845795198]
We study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document.
We focus our evaluation on three datasets that we created from three different Web archives.
Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
arXiv Detail & Related papers (2020-09-02T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.