Related papers: PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis

PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis

URL: http://arxiv.org/abs/2304.11810v1
Date: Mon, 24 Apr 2023 03:54:48 GMT
Title: PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis
Authors: Shu Wei and Nuo Xu
Abstract summary: We present a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets. Our model is suitable for industrial applications, particularly in multi-language scenarios.
Score: 6.155943751502232
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document layout analysis has a wide range of requirements across various domains, languages, and business scenarios. However, most current state-of-the-art algorithms are language-dependent, with architectures that rely on transformer encoders or language-specific text encoders, such as BERT, for feature extraction. These approaches are limited in their ability to handle very long documents due to input sequence length constraints and are closely tied to language-specific tokenizers. Additionally, training a cross-language text encoder can be challenging due to the lack of labeled multilingual document datasets that consider privacy. Furthermore, some layout tasks require a clean separation between different layout components without overlap, which can be difficult for image segmentation-based algorithms to achieve. In this paper, we present Paragraph2Graph, a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets while being adaptable to business scenarios with strict separation. With only 19.95 million parameters, our model is suitable for industrial applications, particularly in multi-language scenarios.

Related papers

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z)
XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser [35.69888780388425]
In this work, we introduce a simple but effective textbfMultimodal and textbfMultilingual semi-structured textbfFORM textbfXForm framework. textbfXForm is anchored on a comprehensive pre-trained language model and innovatively amalgamates entity recognition and relationRE. Our framework exhibits exceptionally improved performance across tasks in both multi-language and zero-shot contexts.
arXiv Detail & Related papers (2024-05-27T16:37:17Z)
Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order. The model is language-agnostic and runs effectively across multi-language datasets. It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z)
Entry Separation using a Mixed Visual and Textual Language Model: Application to 19th century French Trade Directories [18.323615434182553]
A key challenge is to correctly segment what constitutes the basic text regions for the target database. We propose a new pragmatic approach whose efficiency is demonstrated on 19th century French Trade Directories. By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously.
arXiv Detail & Related papers (2023-02-17T15:30:44Z)
Generalized Decoding for Pixel, Image, and Language [197.85760901840177]
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks.
arXiv Detail & Related papers (2022-12-21T18:58:41Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies [0.0]
This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
arXiv Detail & Related papers (2020-12-15T10:42:40Z)
A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching. Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.