CED: Catalog Extraction from Documents
- URL: http://arxiv.org/abs/2304.14662v1
- Date: Fri, 28 Apr 2023 07:32:00 GMT
- Title: CED: Catalog Extraction from Documents
- Authors: Tong Zhu, Guoliang Zhang, Zechang Li, Zijian Yu, Junfei Ren, Mengsong
Wu, Zhefeng Wang, Baoxing Huai, Pingfu Chao, Wenliang Chen
- Abstract summary: We propose a transition-based framework for parsing documents into catalog trees.
We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents.
- Score: 12.037861186708799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentence-by-sentence information extraction from long documents is an
exhausting and error-prone task. As the indicator of document skeleton,
catalogs naturally chunk documents into segments and provide informative
cascade semantics, which can help to reduce the search space. Despite their
usefulness, catalogs are hard to be extracted without the assist from external
knowledge. For documents that adhere to a specific template, regular
expressions are practical to extract catalogs. However, handcrafted heuristics
are not applicable when processing documents from different sources with
diverse formats. To address this problem, we build a large manually annotated
corpus, which is the first dataset for the Catalog Extraction from Documents
(CED) task. Based on this corpus, we propose a transition-based framework for
parsing documents into catalog trees. The experimental results demonstrate that
our proposed method outperforms baseline systems and shows a good ability to
transfer. We believe the CED task could fill the gap between raw text segments
and information extraction tasks on extremely long documents. Data and code are
available at \url{https://github.com/Spico197/CatalogExtraction}
Related papers
- Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - The Law of Large Documents: Understanding the Structure of Legal
Contracts Using Visual Cues [0.7425558351422133]
We measure the impact of incorporating visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks.
Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks.
arXiv Detail & Related papers (2021-07-16T21:21:50Z) - Doc2Dict: Information Extraction as Text Generation [0.0]
Doc2Dict is a pipeline for extracting document-level information.
We train a language model on existing database records to generate structured spans.
We use checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single baseline.
arXiv Detail & Related papers (2021-05-16T20:46:29Z) - Extracting Variable-Depth Logical Document Hierarchy from Long
Documents: Method, Evaluation, and Application [21.270184491603864]
We develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper on of the current tree.
Experiments based on thousands of long documents from Chinese, English financial market and English scientific publication.
We show that logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task.
arXiv Detail & Related papers (2021-05-14T06:26:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.