Related papers: RLIPv2: Fast Scaling of Relational Language-Image Pre-training

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

URL: http://arxiv.org/abs/2308.09351v1
Date: Fri, 18 Aug 2023 07:17:09 GMT
Title: RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao
Abstract summary: We propose RLIPv2, a fast converging model that enables the relational scaling of pre-training to large-scale pseudo-labelled scene graph data. Asymmetric Language-Image Fusion (ALIF) facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding. RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings.
Score: 53.21796397618875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

Related papers

Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next token prediction is the fundamental principle for training large language models (LLMs) We introduce R1-SGG, a multimodal LLM (M-LLM) trained via supervised fine-tuning (SFT) on the scene graph dataset. We design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs [27.169892145194638]
GraphCLIP is a framework to learn graph foundation models with strong cross-domain zero/few-shot transferability. We generate and curate large-scale graph-summary pair data with the assistance of LLMs. For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective.
arXiv Detail & Related papers (2024-10-14T09:40:52Z)
A Condensed Transition Graph Framework for Zero-shot Link Prediction with Large Language Models [20.220781775335645]
We introduce a Condensed Transition Graph Framework for Zero-Shot Link Prediction (CTLP) CTLP encodes all the paths' information in linear time complexity to predict unseen relations between entities. Our proposed CTLP method achieves state-of-the-art performance on three standard ZSLP datasets.
arXiv Detail & Related papers (2024-02-16T16:02:33Z)
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks. The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z)
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection. DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z)
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection [32.20132357830726]
Language-Image Pre-training (LIPR) is a strategy for contrastive pre-training that leverages both entity and relation descriptions. We show the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection as well as increased robustness from noisy annotations.
arXiv Detail & Related papers (2022-09-05T07:50:54Z)
Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences. Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z)
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z)
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval. Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.