BROS: A Layout-Aware Pre-trained Language Model for Understanding
Documents
- URL: http://arxiv.org/abs/2108.04539v1
- Date: Tue, 10 Aug 2021 09:30:23 GMT
- Title: BROS: A Layout-Aware Pre-trained Language Model for Understanding
Documents
- Authors: Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and
Sungrae Park
- Abstract summary: This paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), which effectively utilizes the information included in individual text blocks and their layouts.
BROS encodes spatial information by utilizing relative positions and learns dependencies between OCR blocks with a novel area-masking strategy.
- Score: 13.293166441041238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding documents from their visual snapshots is an emerging problem
that requires both advanced computer vision and NLP methods. The recent advance
in OCR enables the accurate recognition of text blocks, yet it is still
challenging to extract key information from documents due to the diversity of
their layouts. Although recent studies on pre-trained language models show the
importance of incorporating layout information on this task, the conjugation of
texts and their layouts still follows the style of BERT optimized for
understanding the 1D text. This implies there is room for further improvement
considering the 2D nature of text layouts. This paper introduces a pre-trained
language model, BERT Relying On Spatiality (BROS), which effectively utilizes
the information included in individual text blocks and their layouts.
Specifically, BROS encodes spatial information by utilizing relative positions
and learns spatial dependencies between OCR blocks with a novel area-masking
strategy. These two novel approaches lead to an efficient encoding of spatial
layout information highlighted by the robust performance of BROS under
low-resource environments. We also introduce a general-purpose parser that can
be combined with BROS to extract key information even when there is no order
information between text blocks. BROS shows its superiority on four public
benchmarks---FUNSD, SROIE*, CORD, and SciTSR---and its robustness in practical
cases where order information of text blocks is not available. Further
experiments with a varying number of training examples demonstrate the high
training efficiency of our approach. Our code will be open to the public.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - SCOB: Universal Text Understanding via Character-wise Supervised
Contrastive Learning with Online Text Rendering for Bridging Domain Gap [10.011953474950744]
We propose a novel pre-training method called SCOB that leverages character-wise supervised contrastive learning with online text rendering.
SCOB enables weakly supervised learning, significantly reducing annotation costs.
Our findings suggest that SCOB can be served generally and effectively for read-type pre-training methods.
arXiv Detail & Related papers (2023-09-21T15:06:08Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Entry Separation using a Mixed Visual and Textual Language Model:
Application to 19th century French Trade Directories [18.323615434182553]
A key challenge is to correctly segment what constitutes the basic text regions for the target database.
We propose a new pragmatic approach whose efficiency is demonstrated on 19th century French Trade Directories.
By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously.
arXiv Detail & Related papers (2023-02-17T15:30:44Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Knowing Where and What: Unified Word Block Pretraining for Document
Understanding [11.46378901674016]
We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
arXiv Detail & Related papers (2022-07-28T09:43:06Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.