Out-of-Vocabulary Challenge Report
- URL: http://arxiv.org/abs/2209.06717v1
- Date: Wed, 14 Sep 2022 15:25:54 GMT
- Title: Out-of-Vocabulary Challenge Report
- Authors: Sergi Garcia-Bordils, Andr\'es Mafla, Ali Furkan Biten, Oren Nuriel,
Aviad Aberdam, Shai Mazor, Ron Litman, Dimosthenis Karatzas
- Abstract summary: The Out-Of-Vocabulary 2022 (OOV) challenge introduces the recognition of unseen scene text instances at training time.
The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text instances.
A thorough analysis of results from baselines and different participants is presented.
- Score: 15.827931962904115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents final results of the Out-Of-Vocabulary 2022 (OOV)
challenge. The OOV contest introduces an important aspect that is not commonly
studied by Optical Character Recognition (OCR) models, namely, the recognition
of unseen scene text instances at training time. The competition compiles a
collection of public scene text datasets comprising of 326,385 images with
4,864,405 scene text instances, thus covering a wide range of data
distributions. A new and independent validation and test set is formed with
scene text instances that are out of vocabulary at training time. The
competition was structured in two tasks, end-to-end and cropped scene text
recognition respectively. A thorough analysis of results from baselines and
different participants is presented. Interestingly, current state-of-the-art
models show a significant performance gap under the newly studied setting. We
conclude that the OOV dataset proposed in this challenge will be an essential
area to be explored in order to develop scene text models that achieve more
robust and generalized predictions.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual
Text Processing [4.057550183467041]
The field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models.
We present a comprehensive, multi-perspective analysis of recent advancements in this field.
arXiv Detail & Related papers (2024-02-05T15:13:20Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich
Document Images [198.35937007558078]
The competition opened on 30th December, 2022 and closed on 24th March, 2023.
There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2.
According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios.
arXiv Detail & Related papers (2023-06-05T22:20:52Z) - Few-shot Domain-Adaptive Visually-fused Event Detection from Text [13.189886554546929]
We present a novel domain-adaptive visually-fused event detection approach that can be trained on a few labelled image-text paired data points.
Specifically, we introduce a visual imaginator method that synthesises images from text in the absence of visual context.
Our model can leverage the capabilities of pre-trained vision-language models and can be trained in a few-shot setting.
arXiv Detail & Related papers (2023-05-04T00:10:57Z) - 1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: Cropped Word Recognition [35.2137931915091]
This report presents our winner solution to ECCV 2022 challenge on Out-of-Vocabulary Scene Text Understanding (OOV-ST)
Our solution achieves an overall word accuracy of 69.73% when considering both in-vocabulary and out-of-vocabulary words.
arXiv Detail & Related papers (2022-08-04T16:20:58Z) - Text Detection & Recognition in the Wild for Robot Localization [1.52292571922932]
We propose an end-to-end scene text spotting model that simultaneously outputs the text string and bounding boxes.
Our central contribution is introducing utilizing an end-to-end scene text spotting framework to adequately capture the irregular and occluded text regions.
arXiv Detail & Related papers (2022-05-17T18:16:34Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.