Vision-Language Adaptive Mutual Decoder for OOV-STR
- URL: http://arxiv.org/abs/2209.00859v2
- Date: Mon, 30 Oct 2023 03:15:16 GMT
- Title: Vision-Language Adaptive Mutual Decoder for OOV-STR
- Authors: Jinshui Hu, Chenyu Liu, Qiandong Yan, Xuyang Zhu, Jiajia Wu, Jun Du,
Lirong Dai
- Abstract summary: We design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle out-of-vocabulary (OOV) problems partly.
Our approach achieved 70.31% and 59.61% word accuracy on IV+OOV and OOV settings respectively on Cropped Word Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop.
- Score: 39.35424739459689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown huge success of deep learning models for common in
vocabulary (IV) scene text recognition. However, in real-world scenarios,
out-of-vocabulary (OOV) words are of great importance and SOTA recognition
models usually perform poorly on OOV settings. Inspired by the intuition that
the learned language prior have limited OOV preformence, we design a framework
named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle OOV problems
partly. VLAMD consists of three main conponents. Firstly, we build an attention
based LSTM decoder with two adaptively merged visual-only modules, yields a
vision-language balanced main branch. Secondly, we add an auxiliary query based
autoregressive transformer decoding head for common visual and language prior
representation learning. Finally, we couple these two designs with
bidirectional training for more diverse language modeling, and do mutual
sequential decoding to get robuster results. Our approach achieved 70.31\% and
59.61\% word accuracy on IV+OOV and OOV settings respectively on Cropped Word
Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop, where we got
1st place on both settings.
Related papers
- VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer [22.06023928642522]
We propose an innovative scene text recognition approach, named VL-Reader.
The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process.
In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions.
arXiv Detail & Related papers (2024-09-18T02:46:28Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks.
We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.