RLIPv2: Fast Scaling of Relational Language-Image Pre-training
- URL: http://arxiv.org/abs/2308.09351v1
- Date: Fri, 18 Aug 2023 07:17:09 GMT
- Title: RLIPv2: Fast Scaling of Relational Language-Image Pre-training
- Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan,
Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao
- Abstract summary: We propose RLIPv2, a fast converging model that enables the relational scaling of pre-training to large-scale pseudo-labelled scene graph data.
Asymmetric Language-Image Fusion (ALIF) facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding.
RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings.
- Score: 53.21796397618875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relational Language-Image Pre-training (RLIP) aims to align vision
representations with relational texts, thereby advancing the capability of
relational reasoning in computer vision tasks. However, hindered by the slow
convergence of RLIPv1 architecture and the limited availability of existing
scene graph data, scaling RLIPv1 is challenging. In this paper, we propose
RLIPv2, a fast converging model that enables the scaling of relational
pre-training to large-scale pseudo-labelled scene graph data. To enable fast
scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism
that facilitates earlier and deeper gated cross-modal fusion with sparsified
language encoding layers. ALIF leads to comparable or better performance than
RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain
scene graph data at scale, we extend object detection datasets with free-form
relation labels by introducing a captioner (e.g., BLIP) and a designed Relation
Tagger. The Relation Tagger assigns BLIP-generated relation texts to region
pairs, thus enabling larger-scale relational pre-training. Through extensive
experiments conducted on Human-Object Interaction Detection and Scene Graph
Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under
fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2
achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with
just 1% data and yields 45.09mAP with 100% data. Code and models are publicly
available at https://github.com/JacobYuan7/RLIPv2.
Related papers
- GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs [27.169892145194638]
GraphCLIP is a framework to learn graph foundation models with strong cross-domain zero/few-shot transferability.
We generate and curate large-scale graph-summary pair data with the assistance of LLMs.
For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective.
arXiv Detail & Related papers (2024-10-14T09:40:52Z) - A Condensed Transition Graph Framework for Zero-shot Link Prediction with Large Language Models [20.220781775335645]
We introduce a Condensed Transition Graph Framework for Zero-Shot Link Prediction (CTLP)
CTLP encodes all the paths' information in linear time complexity to predict unseen relations between entities.
Our proposed CTLP method achieves state-of-the-art performance on three standard ZSLP datasets.
arXiv Detail & Related papers (2024-02-16T16:02:33Z) - Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels.
Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - RLIP: Relational Language-Image Pre-training for Human-Object
Interaction Detection [32.20132357830726]
Language-Image Pre-training (LIPR) is a strategy for contrastive pre-training that leverages both entity and relation descriptions.
We show the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection as well as increased robustness from noisy annotations.
arXiv Detail & Related papers (2022-09-05T07:50:54Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.