MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting
Recognition
- URL: http://arxiv.org/abs/2303.13931v1
- Date: Fri, 24 Mar 2023 11:40:50 GMT
- Title: MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting
Recognition
- Authors: Marwa Dhiaf, Ahmed Cheikh Rouhou, Yousri Kessentini, Sinda Ben Salem
- Abstract summary: We propose a lite transformer architecture for full-page multi-script handwriting recognition.
The proposed model comes with three advantages.
It can learn the reading order at page-level thanks to a curriculum learning strategy.
It can be easily adapted to other scripts by applying a simple transfer-learning process.
- Score: 3.0682439731292592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Transformer has quickly become the dominant architecture for various
pattern recognition tasks due to its capacity for long-range representation.
However, transformers are data-hungry models and need large datasets for
training. In Handwritten Text Recognition (HTR), collecting a massive amount of
labeled data is a complicated and expensive task. In this paper, we propose a
lite transformer architecture for full-page multi-script handwriting
recognition. The proposed model comes with three advantages: First, to solve
the common problem of data scarcity, we propose a lite transformer model that
can be trained on a reasonable amount of data, which is the case of most HTR
public datasets, without the need for external data. Second, it can learn the
reading order at page-level thanks to a curriculum learning strategy, allowing
it to avoid line segmentation errors, exploit a larger context and reduce the
need for costly segmentation annotations. Third, it can be easily adapted to
other scripts by applying a simple transfer-learning process using only
page-level labeled images. Extensive experiments on different datasets with
different scripts (French, English, Spanish, and Arabic) show the effectiveness
of the proposed model.
Related papers
- Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - Efficient Pre-training for Localized Instruction Generation of Videos [32.13509517228516]
Procedural videos are instrumental in conveying step-by-step instructions.
Process Transformer (ProcX) is a model for end-to-end step localization and instruction generation for procedural videos.
arXiv Detail & Related papers (2023-11-27T16:07:37Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Mask-guided BERT for Few Shot Text Classification [12.361032727044547]
Mask-BERT is a simple and modular framework to help BERT-based architectures tackle few-shot learning.
The core idea is to selectively apply masks on text inputs and filter out irrelevant information, which guides the model to focus on discriminative tokens.
Experimental results on public-domain benchmark datasets demonstrate the effectiveness of Mask-BERT.
arXiv Detail & Related papers (2023-02-21T05:24:00Z) - Cross-Modal Adapter for Text-Video Retrieval [91.9575196703281]
We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - All Birds with One Stone: Multi-task Text Classification for Efficient
Inference with One Forward Pass [34.85886030306857]
In web content classification, multiple classification tasks are predicted from same input text such as a web article.
Existing multitask transformer models need to conduct N forward passes for N tasks with O(N) cost.
We propose a scalable method that can achieve stronger performance with close to O(1) computation cost via only one forward pass.
arXiv Detail & Related papers (2022-05-22T05:16:03Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - Revisiting Transformer-based Models for Long Document Classification [31.60414185940218]
In real-world applications, multi-page multi-paragraph documents are common and cannot be efficiently encoded by vanilla Transformer-based models.
We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers.
We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.
arXiv Detail & Related papers (2022-04-14T00:44:36Z) - DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text
Generation in E-commerce Title and Review Summarization [14.414693156937782]
We propose a novel domain-specific generative pre-training (DS-GPT) method for text generation.
We apply it to the product titleand review summarization problems on E-commerce mobile display.
arXiv Detail & Related papers (2021-12-15T19:02:49Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.