IDEA: Increasing Text Diversity via Online Multi-Label Recognition for
Vision-Language Pre-training
- URL: http://arxiv.org/abs/2207.05333v1
- Date: Tue, 12 Jul 2022 06:14:27 GMT
- Title: IDEA: Increasing Text Diversity via Online Multi-Label Recognition for
Vision-Language Pre-training
- Authors: Xinyu Huang, Youcai Zhang, Ying Cheng, Weiwei Tian, Ruiwei Zhao, Rui
Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, Xiaobo Zhang
- Abstract summary: IDEA stands for increasing text diversity via online multi-label recognition for Vision-Language Pre-training.
We show that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.
- Score: 18.898969509263804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Pre-training (VLP) with large-scale image-text pairs has
demonstrated superior performance in various fields. However, the image-text
pairs co-occurrent on the Internet typically lack explicit alignment
information, which is suboptimal for VLP. Existing methods proposed to adopt an
off-the-shelf object detector to utilize additional image tag information.
However, the object detector is time-consuming and can only identify the
pre-defined object categories, limiting the model capacity. Inspired by the
observation that the texts incorporate incomplete fine-grained image
information, we introduce IDEA, which stands for increasing text diversity via
online multi-label recognition for VLP. IDEA shows that multi-label learning
with image tags extracted from the texts can be jointly optimized during VLP.
Moreover, IDEA can identify valuable image tags online to provide more explicit
textual supervision. Comprehensive experiments demonstrate that IDEA can
significantly boost the performance on multiple downstream datasets with a
small extra computational cost.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Context-Based Semantic-Aware Alignment for Semi-Supervised Multi-Label Learning [37.13424985128905]
Vision-language models pre-trained on large-scale image-text pairs could alleviate the challenge of limited labeled data under SSMLL setting.
We propose a context-based semantic-aware alignment method to solve the SSMLL problem.
arXiv Detail & Related papers (2024-12-25T09:06:54Z) - DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.
It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.
DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.
Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.
We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.
Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and
Unpaired Text-based Image Captioning [46.4308182215488]
A text-based image intuitively contains abundant and complex multimodal relational content.
We propose the Multimodal relAtional Graph adversarIal inferenCe framework for diverse and unpaired TextCap.
We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image.
arXiv Detail & Related papers (2021-12-13T11:00:49Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.