KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation
- URL: http://arxiv.org/abs/2109.10504v1
- Date: Wed, 22 Sep 2021 03:38:05 GMT
- Title: KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation
- Authors: Yongfei Liu, Chenfei Wu, Shao-yen Tseng, Vasudev Lal, Xuming He, Nan
Duan
- Abstract summary: Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
- Score: 42.01427946204401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised vision-and-language pretraining (VLP) aims to learn
transferable multi-modal representations from large-scale image-text data and
to achieve strong performances on a broad scope of vision-language tasks after
finetuning. Previous mainstream VLP approaches typically adopt a two-step
strategy relying on external object detectors to encode images in a multi-modal
Transformer framework, which suffer from restrictive object concept space,
limited image context and inefficient computation. In this paper, we propose an
object-aware end-to-end VLP framework, which directly feeds image grid features
from CNNs into the Transformer and learns the multi-modal representations
jointly. More importantly, we propose to perform object knowledge distillation
to facilitate learning cross-modal alignment at different semantic levels. To
achieve that, we design two novel pretext tasks by taking object features and
their semantic labels from external detectors as supervision: 1.) Object-guided
masked vision modeling task focuses on enforcing object-aware representation
learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims
to improve cross-modal alignment by utilizing the similarities between noun
phrases and object labels in the linguistic space. Extensive experiments on a
wide range of vision-language tasks demonstrate the efficacy of our proposed
framework, and we achieve competitive or superior performances over the
existing pretraining strategies. The code is available in supplementary
materials.
Related papers
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.