VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
Alignment
- URL: http://arxiv.org/abs/2210.04135v3
- Date: Mon, 30 Oct 2023 01:56:14 GMT
- Title: VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
Alignment
- Authors: Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik Shah, Yann
LeCun and Rama Chellappa
- Abstract summary: VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding.
VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training.
Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
- Score: 52.489874804051304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language pre-training (VLP) has recently proven highly effective for
various uni- and multi-modal downstream applications. However, most existing
end-to-end VLP methods use high-resolution image-text box data to perform well
on fine-grained region-level tasks, such as object detection, segmentation, and
referring expression comprehension. Unfortunately, such high-resolution images
with accurate bounding box annotations are expensive to collect and use for
supervision at scale. In this work, we propose VoLTA (Vision-Language
Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm
that only utilizes image-caption data but achieves fine-grained region-level
image understanding, eliminating the use of expensive box annotations. VoLTA
adopts graph optimal transport-based weakly-supervised alignment on local image
patches and text tokens to germinate an explicit, self-normalized, and
interpretable low-level matching criterion. In addition, VoLTA pushes
multi-modal fusion deep into the uni-modal backbones during pre-training and
removes fusion-specific transformer layers, further reducing memory
requirements. Extensive experiments on a wide range of vision- and
vision-language downstream tasks demonstrate the effectiveness of VoLTA on
fine-grained applications without compromising the coarse-grained downstream
performance, often outperforming methods using significantly more caption and
box annotations.
Related papers
- LOBG:Less Overfitting for Better Generalization in Vision-Language Model [19.890629892640206]
We propose a framework named LOBG for vision-language models.
We use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts.
Our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-10-14T08:06:21Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.