Incorporating Visual Layout Structures for Scientific Text
Classification
- URL: http://arxiv.org/abs/2106.00676v1
- Date: Tue, 1 Jun 2021 17:59:00 GMT
- Title: Incorporating Visual Layout Structures for Scientific Text
Classification
- Authors: Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld,
Doug Downey
- Abstract summary: We introduce new methods for incorporating VIsual LAyout structures (VILA), e.g., the grouping of page texts into text lines or text blocks, into language models.
We show that the I-VILA approach, which simply adds special tokens denoting boundaries between layout structures into model inputs, can lead to +14.5 F1 Score improvements in token classification tasks.
- Score: 31.15058113053433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classifying the core textual components of a scientific paper-title, author,
body text, etc.-is a critical first step in automated scientific document
understanding. Previous work has shown how using elementary layout information,
i.e., each token's 2D position on the page, leads to more accurate
classification. We introduce new methods for incorporating VIsual LAyout
structures (VILA), e.g., the grouping of page texts into text lines or text
blocks, into language models to further improve performance. We show that the
I-VILA approach, which simply adds special tokens denoting boundaries between
layout structures into model inputs, can lead to +1~4.5 F1 Score improvements
in token classification tasks. Moreover, we design a hierarchical model H-VILA
that encodes these layout structures and record a up-to 70% efficiency boost
without hurting prediction accuracy. The experiments are conducted on a newly
curated evaluation suite, S2-VLUE, with a novel metric measuring VILA awareness
and a new dataset covering 19 scientific disciplines with gold annotations.
Pre-trained weights, benchmark datasets, and source code will be available at
https://github.com/allenai/VILA}{https://github.com/allenai/VILA.
Related papers
- Leveraging Structure Knowledge and Deep Models for the Detection of Abnormal Handwritten Text [19.05500901000957]
We propose a two-stage detection algorithm that combines structure knowledge and deep models for handwritten text.
A shape regression network trained by a novel semi-supervised contrast training strategy is introduced and the positional relationship between the characters is fully employed.
Experiments on two handwritten text datasets show that the proposed method can greatly improve the detection performance.
arXiv Detail & Related papers (2024-10-15T14:57:10Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings [20.25180279903009]
We propose Contrastive Graph-Text pretraining (ConGraT) for jointly learning separate representations of texts and nodes in a text-attributed graph (TAG)
Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP.
Experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling.
arXiv Detail & Related papers (2023-05-23T17:53:30Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - Entry Separation using a Mixed Visual and Textual Language Model:
Application to 19th century French Trade Directories [18.323615434182553]
A key challenge is to correctly segment what constitutes the basic text regions for the target database.
We propose a new pragmatic approach whose efficiency is demonstrated on 19th century French Trade Directories.
By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously.
arXiv Detail & Related papers (2023-02-17T15:30:44Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - Stage-wise Fine-tuning for Graph-to-Text Generation [25.379346921398326]
Graph-to-text generation has benefited from pre-trained language models (PLMs) in achieving better performance than structured graph encoders.
We propose a structured graph-to-text model with a two-step fine-tuning mechanism which first fine-tunes model on Wikipedia before adapting to the graph-to-text generation.
arXiv Detail & Related papers (2021-05-17T17:15:29Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Structure-Augmented Text Representation Learning for Efficient Knowledge
Graph Completion [53.31911669146451]
Human-curated knowledge graphs provide critical supportive information to various natural language processing tasks.
These graphs are usually incomplete, urging auto-completion of them.
graph embedding approaches, e.g., TransE, learn structured knowledge via representing graph elements into dense embeddings.
textual encoding approaches, e.g., KG-BERT, resort to graph triple's text and triple-level contextualized representations.
arXiv Detail & Related papers (2020-04-30T13:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.