Capturing Logical Structure of Visually Structured Documents with
Multimodal Transition Parser
- URL: http://arxiv.org/abs/2105.00150v1
- Date: Sat, 1 May 2021 02:33:50 GMT
- Title: Capturing Logical Structure of Visually Structured Documents with
Multimodal Transition Parser
- Authors: Yuta Koreeda, Christopher D. Manning
- Abstract summary: We propose to formulate the task as prediction of transition labels between text fragments that maps the fragments to a tree.
We developed a feature-based machine learning system that fuses visual, textual and semantic cues.
Our system obtained a paragraph boundary detection F1 score of 0.951 which is significantly better than a popular PDF-to-text tool with a F1 score of 0.739.
- Score: 39.75232199445175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While many NLP papers, tasks and pipelines assume raw, clean texts, many
texts we encounter in the wild are not so clean, with many of them being
visually structured documents (VSDs) such as PDFs. Conventional preprocessing
tools for VSDs mainly focused on word segmentation and coarse layout analysis,
while fine-grained logical structure analysis (such as identifying paragraph
boundaries and their hierarchies) of VSDs is underexplored. To that end, we
proposed to formulate the task as prediction of transition labels between text
fragments that maps the fragments to a tree, and developed a feature-based
machine learning system that fuses visual, textual and semantic cues. Our
system significantly outperformed baselines in identifying different structures
in VSDs. For example, our system obtained a paragraph boundary detection F1
score of 0.951 which is significantly better than a popular PDF-to-text tool
with a F1 score of 0.739.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis [52.34110239735265]
We present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis.
Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance.
arXiv Detail & Related papers (2024-05-13T05:48:35Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis [52.01356859448068]
HTS can recognize text in an image and identify its 4-level hierarchical structure: characters, words, lines, and paragraphs.
HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks.
arXiv Detail & Related papers (2023-10-25T22:23:54Z) - DSG: An End-to-End Document Structure Generator [32.040520771901996]
Document Structure Generator (DSG) is a novel system for document parsing that is fully end-to-end trainable.
Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-13T14:03:01Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis [6.155943751502232]
We present a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets.
Our model is suitable for industrial applications, particularly in multi-language scenarios.
arXiv Detail & Related papers (2023-04-24T03:54:48Z) - VTLayout: Fusion of Visual and Text Features for Document Layout
Analysis [5.836306027133707]
Document layout analysis (DLA) has the potential to capture rich information in historical or scientific documents on a large scale.
This paper proposes a VT model fusing the documents' deep visual, shallow visual, and text features to identify category blocks.
The identification capability of the VT is superior to the most advanced method of DLA based on the PubLayNet dataset, and the F1 score is as high as 0.9599.
arXiv Detail & Related papers (2021-08-12T17:12:11Z) - StrucTexT: Structured Text Understanding with Multi-Modal Transformers [29.540122964399046]
Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence.
This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks.
We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts.
arXiv Detail & Related papers (2021-08-06T02:57:07Z) - Topical Change Detection in Documents via Embeddings of Long Sequences [4.13878392637062]
We formulate the task of text segmentation as an independent supervised prediction task.
By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information.
Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context.
arXiv Detail & Related papers (2020-12-07T12:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.