Multimodal Semi-Supervised Learning for Text Recognition
- URL: http://arxiv.org/abs/2205.03873v1
- Date: Sun, 8 May 2022 13:55:30 GMT
- Title: Multimodal Semi-Supervised Learning for Text Recognition
- Authors: Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman
- Abstract summary: We present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase.
Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training.
In a novel setup, consistency is enforced on each modality separately.
- Score: 10.33262222726707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Until recently, the number of public real-world text images was insufficient
for training scene text recognizers. Therefore, most modern training methods
rely on synthetic data and operate in a fully supervised manner. Nevertheless,
the amount of public real-world text images has increased significantly lately,
including a great deal of unlabeled data. Leveraging these resources requires
semi-supervised approaches; however, the few existing methods do not account
for vision-language multimodality structure and therefore suboptimal for
state-of-the-art multimodal architectures. To bridge this gap, we present
semi-supervised learning for multimodal text recognizers (SemiMTR) that
leverages unlabeled data at each modality training phase. Notably, our method
refrains from extra training stages and maintains the current three-stage
multimodal training procedure. Our algorithm starts by pretraining the vision
model through a single-stage training that unifies self-supervised learning
with supervised training. More specifically, we extend an existing visual
representation learning algorithm and propose the first contrastive-based
method for scene text recognition. After pretraining the language model on a
text corpus, we fine-tune the entire network via a sequential, character-level,
consistency regularization between weakly and strongly augmented views of text
images. In a novel setup, consistency is enforced on each modality separately.
Extensive experiments validate that our method outperforms the current training
schemes and achieves state-of-the-art results on multiple scene text
recognition benchmarks.
Related papers
- TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Generalization algorithm of multimodal pre-training model based on
graph-text self-supervised training [0.0]
multimodal pre-training generalization algorithm for self-supervised training is proposed.
We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.
arXiv Detail & Related papers (2023-02-16T03:34:08Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.