VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations
- URL: http://arxiv.org/abs/2105.06220v1
- Date: Thu, 13 May 2021 12:20:30 GMT
- Title: VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations
- Authors: Peng Zhang and Can Li and Liang Qiao and Zhanzhan Cheng and Shiliang
Pu and Yi Niu and Fei Wu
- Abstract summary: We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations.
On three popular benchmarks, VSR outperforms previous models by large margins.
- Score: 40.721146438291335
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Document layout analysis is crucial for understanding document structures. On
this task, vision and semantics of documents, and relations between layout
components contribute to the understanding process. Though many works have been
proposed to exploit the above information, they show unsatisfactory results.
NLP-based methods model layout analysis as a sequence labeling task and show
insufficient capabilities in layout modeling. CV-based methods model layout
analysis as a detection or segmentation task, but bear limitations of
inefficient modality fusion and lack of relation modeling between layout
components. To address the above limitations, we propose a unified framework
VSR for document layout analysis, combining vision, semantics and relations.
VSR supports both NLP-based and CV-based methods. Specifically, we first
introduce vision through document image and semantics through text embedding
maps. Then, modality-specific visual and semantic features are extracted using
a two-stream network, which are adaptively fused to make full use of
complementary information. Finally, given component candidates, a relation
module based on graph neural network is incorported to model relations between
components and output final results. On three popular benchmarks, VSR
outperforms previous models by large margins. Code will be released soon.
Related papers
- Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - DLAFormer: An End-to-End Transformer For Document Layout Analysis [7.057192434574117]
We propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer.
We treat various DLA sub-tasks as relation prediction problems and consolidate these relation prediction labels into a unified label space.
We introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR.
arXiv Detail & Related papers (2024-05-20T03:34:24Z) - GeoContrastNet: Contrastive Key-Value Edge Learning for Language-Agnostic Document Understanding [4.258365032282028]
We present a language-agnostic framework to structured document understanding (DU) by integrating a contrastive learning objective with graph attention networks (GATs)
We propose a novel methodology that combines geometric edge features with visual features within an overall two-staged GAT-based framework.
Our results highlight the model's proficiency in identifying key-value relationships within the FUNSD dataset for forms and also discovering the spatial relationships in table-structured layouts for RVLCDIP business invoices.
arXiv Detail & Related papers (2024-05-06T01:40:20Z) - A Semantic Mention Graph Augmented Model for Document-Level Event Argument Extraction [12.286432133599355]
Document-level Event Argument Extraction (DEAE) aims to identify arguments and their specific roles from an unstructured document.
advanced approaches on DEAE utilize prompt-based methods to guide pre-trained language models (PLMs) in extracting arguments from input documents.
We propose a semantic mention Graph Augmented Model (GAM) to address these two problems in this paper.
arXiv Detail & Related papers (2024-03-12T08:58:07Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - DocTr: Document Transformer for Structured Information Extraction in
Documents [36.1145541816468]
We present a new formulation for structured information extraction from visually rich documents.
It aims to address the limitations of existing IOB tagging or graph-based formulations.
We represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words.
arXiv Detail & Related papers (2023-07-16T02:59:30Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - Bidirectional Graph Reasoning Network for Panoptic Segmentation [126.06251745669107]
We introduce a Bidirectional Graph Reasoning Network (BGRNet) to mine the intra-modular and intermodular relations within and between foreground things and background stuff classes.
BGRNet first constructs image-specific graphs in both instance and semantic segmentation branches that enable flexible reasoning at the proposal level and class level.
arXiv Detail & Related papers (2020-04-14T02:32:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.