Vision Grid Transformer for Document Layout Analysis
- URL: http://arxiv.org/abs/2308.14978v1
- Date: Tue, 29 Aug 2023 02:09:56 GMT
- Title: Vision Grid Transformer for Document Layout Analysis
- Authors: Cheng Da, Chuwei Luo, Qi Zheng, Cong Yao
- Abstract summary: We present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding.
Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on document layout analysis tasks.
- Score: 26.62857594455592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document pre-trained models and grid-based models have proven to be very
effective on various tasks in Document AI. However, for the document layout
analysis (DLA) task, existing document pre-trained models, even those
pre-trained in a multi-modal fashion, usually rely on either textual features
or visual features. Grid-based models for DLA are multi-modality but largely
neglect the effect of pre-training. To fully leverage multi-modal information
and exploit pre-training techniques to learn better representation for DLA, in
this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid
Transformer (GiT) is proposed and pre-trained for 2D token-level and
segment-level semantic understanding. Furthermore, a new dataset named D$^4$LA,
which is so far the most diverse and detailed manually-annotated benchmark for
document layout analysis, is curated and released. Experiment results have
illustrated that the proposed VGT model achieves new state-of-the-art results
on DLA tasks, e.g. PubLayNet ($95.7\%$$\rightarrow$$96.2\%$), DocBank
($79.6\%$$\rightarrow$$84.1\%$), and D$^4$LA ($67.7\%$$\rightarrow$$68.8\%$).
The code and models as well as the D$^4$LA dataset will be made publicly
available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.
Related papers
- VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.
In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.
In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z) - DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights [8.139817615390147]
This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework.
DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling.
arXiv Detail & Related papers (2024-10-02T14:47:55Z) - DLAFormer: An End-to-End Transformer For Document Layout Analysis [7.057192434574117]
We propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer.
We treat various DLA sub-tasks as relation prediction problems and consolidate these relation prediction labels into a unified label space.
We introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR.
arXiv Detail & Related papers (2024-05-20T03:34:24Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - A Graphical Approach to Document Layout Analysis [2.5108258530670606]
Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document.
Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs.
We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network.
arXiv Detail & Related papers (2023-08-03T21:09:59Z) - M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout,
Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout
Analysis [23.924144353511984]
This paper introduces a large and diverse document layout analysis dataset called $M6Doc$.
We propose a transformer-based document layout analysis method called TransDLANet.
We conduct a comprehensive evaluation of $M6Doc$ with various layout analysis methods and demonstrate its effectiveness.
arXiv Detail & Related papers (2023-05-15T15:29:06Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - DiT: Self-supervised Pre-training for Document Image Transformer [85.78807512344463]
We propose DiT, a self-supervised pre-trained Document Image Transformer model.
We leverage DiT as the backbone network in a variety of vision-based Document AI tasks.
Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-03-04T15:34:46Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.