Transferring General Multimodal Pretrained Models to Text Recognition
- URL: http://arxiv.org/abs/2212.09297v1
- Date: Mon, 19 Dec 2022 08:30:42 GMT
- Title: Transferring General Multimodal Pretrained Models to Text Recognition
- Authors: Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An
Yang, Chang Zhou
- Abstract summary: We recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task.
We construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API.
- Score: 46.33867696799362
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained
models to text recognition. Specifically, we recast text recognition as image
captioning and directly transfer a unified vision-language pretrained model to
the end task. Without pretraining on large-scale annotated or synthetic text
recognition data, OFA-OCR outperforms the baselines and achieves
state-of-the-art performance in the Chinese text recognition benchmark.
Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate
that it can achieve competitive performance with the product-level API. The
code (https://github.com/OFA-Sys/OFA) and demo
(https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly
available.
Related papers
- Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Fuzzy Fingerprinting Transformer Language-Models for Emotion Recognition
in Conversations [0.7874708385247353]
We propose to combine the two approaches to perform Emotion Recognition in Conversations (ERC)
We feed utterances and their previous conversational turns to a pre-trained RoBERTa, obtaining contextual embedding utterance representations.
We validate our approach on the widely used DailyDialog ERC benchmark dataset.
arXiv Detail & Related papers (2023-09-08T12:26:01Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - Generating Human Readable Transcript for Automatic Speech Recognition
with Pre-trained Language Model [18.26945997660616]
Many downstream tasks and human readers rely on the output of the ASR system.
We propose an ASR post-processing model that aims to transform the incorrect and noisy ASR output into a readable text.
arXiv Detail & Related papers (2021-02-22T15:45:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.