LAMBERT: Layout-Aware (Language) Modeling for information extraction
- URL: http://arxiv.org/abs/2002.08087v5
- Date: Fri, 28 May 2021 12:29:14 GMT
- Title: LAMBERT: Layout-Aware (Language) Modeling for information extraction
- Authors: {\L}ukasz Garncarek and Rafa{\l} Powalski and Tomasz Stanis{\l}awek
and Bartosz Topolski and Piotr Halama and Micha{\l} Turski and Filip
Grali\'nski
- Abstract summary: We introduce a new approach to the problem of understanding documents where non-trivial layout influences the local semantics.
We modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system.
We show that our model achieves superior performance on datasets consisting of visually rich documents.
- Score: 2.5907188217412456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a simple new approach to the problem of understanding documents
where non-trivial layout influences the local semantics. To this end, we modify
the Transformer encoder architecture in a way that allows it to use layout
features obtained from an OCR system, without the need to re-learn language
semantics from scratch. We only augment the input of the model with the
coordinates of token bounding boxes, avoiding, in this way, the use of raw
images. This leads to a layout-aware language model which can then be
fine-tuned on downstream tasks.
The model is evaluated on an end-to-end information extraction task using
four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and
CORD. We show that our model achieves superior performance on datasets
consisting of visually rich documents, while also outperforming the baseline
RoBERTa on documents with flat layout (NDA \(F_{1}\) increase from 78.50 to
80.42). Our solution ranked first on the public leaderboard for the Key
Information Extraction from the SROIE dataset, improving the SOTA
\(F_{1}\)-score from 97.81 to 98.17.
Related papers
- TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions.
By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest [51.68383826362895]
We propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction.
Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience.
arXiv Detail & Related papers (2023-07-07T13:43:44Z) - Towards Few-shot Entity Recognition in Document Images: A Graph Neural
Network Approach Robust to Image Manipulation [38.09501948846373]
We introduce the topological adjacency relationship among the tokens, emphasizing their relative position information.
We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings.
Experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings.
arXiv Detail & Related papers (2023-05-24T07:34:33Z) - DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Spatial Dual-Modality Graph Reasoning for Key Information Extraction [31.04597531115209]
We propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images.
We release a new dataset named WildReceipt, which is collected and annotated for the evaluation of key information extraction from document images of unseen templates in the wild.
arXiv Detail & Related papers (2021-03-26T13:46:00Z) - Robust Layout-aware IE for Visually Rich Documents with Pre-trained
Language Models [23.42593796135709]
We study the problem of information extraction from visually rich documents (VRDs)
We present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents.
arXiv Detail & Related papers (2020-05-22T06:04:50Z) - Abstractive Text Summarization based on Language Model Conditioning and
Locality Modeling [4.525267347429154]
We train a Transformer-based neural model on the BERT language model.
In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size.
The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset.
arXiv Detail & Related papers (2020-03-29T14:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.