PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Table
Image Recognition to Latex
- URL: http://arxiv.org/abs/2105.01846v1
- Date: Wed, 5 May 2021 03:15:48 GMT
- Title: PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Table
Image Recognition to Latex
- Authors: Yelin He and Xianbiao Qi and Jiaquan Ye and Peng Gao and Yihao Chen
and Bingcong Li and Xin Tang and Rong Xiao
- Abstract summary: ICDAR 2021 Competition has two sub-tasks: Table Structure Reconstruction (TSR) and Table Content Reconstruction (TCR)
We leverage our previously proposed algorithm MASTER citelu 2019master, which is originally proposed for scene text recognition.
Our method achieves 0.7444 Exact Match and 0.8765 Exact Match @95% on the TSR task, and obtains 0.5586 Exact Match and 0.7386 Exact Match 95% on the TCR task.
- Score: 16.003357804292513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents our solution for the ICDAR 2021 Competition on Scientific
Table Image Recognition to LaTeX. This competition has two sub-tasks: Table
Structure Reconstruction (TSR) and Table Content Reconstruction (TCR). We treat
both sub-tasks as two individual image-to-sequence recognition problems. We
leverage our previously proposed algorithm MASTER \cite{lu2019master}, which is
originally proposed for scene text recognition. We optimize the MASTER model
from several perspectives: network structure, optimizer, normalization method,
pre-trained model, resolution of input image, data augmentation, and model
ensemble. Our method achieves 0.7444 Exact Match and 0.8765 Exact Match @95\%
on the TSR task, and obtains 0.5586 Exact Match and 0.7386 Exact Match 95\% on
the TCR task.
Related papers
- MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - Automatic Creative Selection with Cross-Modal Matching [0.4215938932388723]
We present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model.
We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs.
arXiv Detail & Related papers (2024-02-28T22:05:38Z) - StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
Representation Learners [58.941838860425754]
We show that training self-supervised methods on synthetic images can match or beat the real image counterpart.
We develop a multi-positive contrastive learning method, which we call StableRep.
With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP.
arXiv Detail & Related papers (2023-06-01T17:59:51Z) - Co-training $2^L$ Submodels for Visual Recognition [67.02999567435626]
Submodel co-training is a regularization method related to co-training, self-distillation and depth.
We show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation.
arXiv Detail & Related papers (2022-12-09T14:38:09Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z) - ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX [1.149654395906819]
This paper discusses the dataset, tasks, participants' methods, and results of the ICDAR 2021 Competition on Scientific Table Image Recognition.
We propose two subtasks: reconstruct the structure code from an image, and reconstruct the content code from an image.
This report describes the datasets and ground truth specification, details the performance evaluation metrics used, presents the final results, and summarizes the participating methods.
arXiv Detail & Related papers (2021-05-30T04:17:55Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - SSCR: Iterative Language-Based Image Editing via Self-Supervised
Counterfactual Reasoning [79.30956389694184]
Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step.
Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes.
We introduce a Self-Supervised Counterfactual Reasoning framework that incorporates counterfactual thinking to overcome data scarcity.
arXiv Detail & Related papers (2020-09-21T01:45:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.