1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: End-to-End Recognition of Out of Vocabulary Words
- URL: http://arxiv.org/abs/2209.00224v1
- Date: Thu, 1 Sep 2022 04:53:13 GMT
- Title: 1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: End-to-End Recognition of Out of Vocabulary Words
- Authors: Zhangzi Zhu, Chuhui Xue, Yu Hao, Wenqing Zhang, Song Bai
- Abstract summary: We describe our solution to the Out of Vocabulary Scene Text Understanding (OOV-ST) Challenge.
Our oCLIP-based model achieves 28.59% in h-mean which ranks 1st in end-to-end OOV word recognition track of OOV Challenge.
- Score: 35.2137931915091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition has attracted increasing interest in recent years due
to its wide range of applications in multilingual translation, autonomous
driving, etc. In this report, we describe our solution to the Out of Vocabulary
Scene Text Understanding (OOV-ST) Challenge, which aims to extract
out-of-vocabulary (OOV) words from natural scene images. Our oCLIP-based model
achieves 28.59\% in h-mean which ranks 1st in end-to-end OOV word recognition
track of OOV Challenge in ECCV2022 TiE Workshop.
Related papers
- V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results [142.5704093410454]
The V3Det Challenge 2024 aims to push the boundaries of object detection research.
The challenge consists of two tracks: Vast Vocabulary Object Detection and Open Vocabulary Object Detection.
We aim to inspire future research directions in vast vocabulary and open-vocabulary object detection.
arXiv Detail & Related papers (2024-06-17T16:58:51Z) - VK-G2T: Vision and Context Knowledge enhanced Gloss2Text [60.57628465740138]
Existing sign language translation methods follow a two-stage pipeline: first converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and then translating the generated gloss sequence into a spoken language sentence (i.e. Gloss2Text)
We propose a vision and context knowledge enhanced Gloss2Text model, named VK-G2T, which leverages the visual content of the sign language video to learn the properties of the target sentence and exploit the context knowledge to facilitate the adaptive translation of gloss words.
arXiv Detail & Related papers (2023-12-15T21:09:34Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal
Information Retrieval for Visual Word Sense Disambiguation [0.0]
We present our submission to SemEval 2023 visual word sense disambiguation shared task.
The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches.
Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
arXiv Detail & Related papers (2023-04-14T13:45:59Z) - Out-of-Vocabulary Challenge Report [15.827931962904115]
The Out-Of-Vocabulary 2022 (OOV) challenge introduces the recognition of unseen scene text instances at training time.
The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text instances.
A thorough analysis of results from baselines and different participants is presented.
arXiv Detail & Related papers (2022-09-14T15:25:54Z) - Vision-Language Adaptive Mutual Decoder for OOV-STR [39.35424739459689]
We design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle out-of-vocabulary (OOV) problems partly.
Our approach achieved 70.31% and 59.61% word accuracy on IV+OOV and OOV settings respectively on Cropped Word Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop.
arXiv Detail & Related papers (2022-09-02T07:32:22Z) - 1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene
Text Understanding: Cropped Word Recognition [35.2137931915091]
This report presents our winner solution to ECCV 2022 challenge on Out-of-Vocabulary Scene Text Understanding (OOV-ST)
Our solution achieves an overall word accuracy of 69.73% when considering both in-vocabulary and out-of-vocabulary words.
arXiv Detail & Related papers (2022-08-04T16:20:58Z) - The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) [186.7816349401443]
We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020.
The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval.
arXiv Detail & Related papers (2020-08-03T09:55:26Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.