1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: Cropped Word Recognition
- URL: http://arxiv.org/abs/2208.02747v1
- Date: Thu, 4 Aug 2022 16:20:58 GMT
- Title: 1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: Cropped Word Recognition
- Authors: Zhangzi Zhu, Yu Hao, Wenqing Zhang, Chuhui Xue, Song Bai
- Abstract summary: This report presents our winner solution to ECCV 2022 challenge on Out-of-Vocabulary Scene Text Understanding (OOV-ST)
Our solution achieves an overall word accuracy of 69.73% when considering both in-vocabulary and out-of-vocabulary words.
- Score: 35.2137931915091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report presents our winner solution to ECCV 2022 challenge on
Out-of-Vocabulary Scene Text Understanding (OOV-ST) : Cropped Word Recognition.
This challenge is held in the context of ECCV 2022 workshop on Text in
Everything (TiE), which aims to extract out-of-vocabulary words from natural
scene images. In the competition, we first pre-train SCATTER on the synthetic
datasets, then fine-tune the model on the training set with data augmentations.
Meanwhile, two additional models are trained specifically for long and vertical
texts. Finally, we combine the output from different models with different
layers, different backbones, and different seeds as the final results. Our
solution achieves an overall word accuracy of 69.73% when considering both
in-vocabulary and out-of-vocabulary words.
Related papers
- COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Out-of-Vocabulary Challenge Report [15.827931962904115]
The Out-Of-Vocabulary 2022 (OOV) challenge introduces the recognition of unseen scene text instances at training time.
The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text instances.
A thorough analysis of results from baselines and different participants is presented.
arXiv Detail & Related papers (2022-09-14T15:25:54Z) - DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for
Text-to-Image Generation [71.87682778102236]
We propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture.
DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks.
arXiv Detail & Related papers (2022-09-03T06:13:26Z) - 1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: End-to-End Recognition of Out of Vocabulary Words [35.2137931915091]
We describe our solution to the Out of Vocabulary Scene Text Understanding (OOV-ST) Challenge.
Our oCLIP-based model achieves 28.59% in h-mean which ranks 1st in end-to-end OOV word recognition track of OOV Challenge.
arXiv Detail & Related papers (2022-09-01T04:53:13Z) - Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding [59.8167502322261]
We propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture.
The embedding of each word from the query sentence is treated alike by attending to visual pixels individually.
The proposed Word2Pix outperforms existing one-stage methods by a notable margin.
arXiv Detail & Related papers (2021-07-31T10:20:15Z) - The Zero Resource Speech Challenge 2020: Discovering discrete subword
and word units [40.41406551797358]
Zero Resource Speech Challenge 2020 aims at learning speech representations from raw audio signals without any labels.
We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
arXiv Detail & Related papers (2020-10-12T18:56:48Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 [20.456325305495966]
This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task.
The task evaluates systems' ability to translate English TED talks audio into German texts.
Our system is an end-to-end model based on an adaptation of the Transformer for speech data.
arXiv Detail & Related papers (2020-06-04T15:47:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.