Self-Supervised Pre-Training for Table Structure Recognition Transformer
- URL: http://arxiv.org/abs/2402.15578v1
- Date: Fri, 23 Feb 2024 19:34:06 GMT
- Title: Self-Supervised Pre-Training for Table Structure Recognition Transformer
- Authors: ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari
Balasubramaniyan and Duen Horng Chau
- Abstract summary: We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
- Score: 25.04573593082671
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Table structure recognition (TSR) aims to convert tabular images into a
machine-readable format. Although hybrid convolutional neural network
(CNN)-transformer architecture is widely used in existing approaches, linear
projection transformer has outperformed the hybrid architecture in numerous
vision tasks due to its simplicity and efficiency. However, existing research
has demonstrated that a direct replacement of CNN backbone with linear
projection leads to a marked performance drop. In this work, we resolve the
issue by proposing a self-supervised pre-training (SSP) method for TSR
transformers. We discover that the performance gap between the linear
projection transformer and the hybrid CNN-transformer can be mitigated by SSP
of the visual encoder in the TSR model. We conducted reproducible ablation
studies and open-sourced our code at https://github.com/poloclub/unitable to
enhance transparency, inspire innovations, and facilitate fair comparisons in
our domain as tables are a promising modality for representation learning.
Related papers
- Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution [6.857919231112562]
Window-based transformers have demonstrated outstanding performance in super-resolution tasks.
They exhibit higher computational complexity and inference latency than convolutional neural networks.
We construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet)
arXiv Detail & Related papers (2024-09-26T07:24:09Z) - Convolutional Initialization for Data-Efficient Vision Transformers [38.63299194992718]
Training vision transformer networks on small datasets poses challenges.
CNNs can achieve state-of-the-art performance by leveraging their architectural inductive bias.
Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs.
arXiv Detail & Related papers (2024-01-23T06:03:16Z) - High-Performance Transformers for Table Structure Recognition Need Early
Convolutions [25.04573593082671]
Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder.
We design a lightweight visual encoder for table structure recognition (TSR) without sacrificing expressive power.
We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model.
arXiv Detail & Related papers (2023-11-09T18:20:52Z) - B-cos Alignment for Inherently Interpretable CNNs and Vision
Transformers [97.75725574963197]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
We show that a sequence of such transformations induces a single linear transformation that faithfully summarises the full model computations.
We show that the resulting explanations are of high visual quality and perform well under quantitative interpretability metrics.
arXiv Detail & Related papers (2023-06-19T12:54:28Z) - NAR-Former V2: Rethinking Transformer for Universal Neural Network
Representation Learning [25.197394237526865]
We propose a modified Transformer-based universal neural network representation learning model NAR-Former V2.
Specifically, we take the network as a graph and design a straightforward tokenizer to encode the network into a sequence.
We incorporate the inductive representation learning capability of GNN into Transformer, enabling Transformer to generalize better when encountering unseen architecture.
arXiv Detail & Related papers (2023-06-19T09:11:04Z) - B-cos Networks: Alignment is All We Need for Interpretability [136.27303006772294]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
A B-cos transform induces a single linear transform that faithfully summarises the full model computations.
We show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets.
arXiv Detail & Related papers (2022-05-20T16:03:29Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Transformer Assisted Convolutional Network for Cell Instance
Segmentation [5.195101477698897]
We present a transformer based approach to enhance the performance of the conventional convolutional feature extractor.
Our approach merges the convolutional feature maps with transformer-based token embeddings by applying a projection operation similar to self-attention in transformers.
arXiv Detail & Related papers (2021-10-05T18:18:31Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.