UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding
- URL: http://arxiv.org/abs/2307.00862v1
- Date: Mon, 3 Jul 2023 09:03:12 GMT
- Title: UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding
- Authors: Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang,
Shih-Fu Chang
- Abstract summary: We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
- Score: 84.83494254263138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because
they require the model's reasoning ability to understand the semantics of the
visual world and natural language. Supervised methods working for
vision-language tasks have been well-studied. However, solving these tasks in a
zero-shot setting is less explored. Since Contrastive Language-Image
Pre-training (CLIP) has shown remarkable zero-shot performance on image-text
matching, previous works utilized its strong zero-shot ability by converting
vision-language tasks into an image-text matching problem, and they mainly
consider global-level matching (e.g., the whole image or sentence). However, we
find visual and textual fine-grained information, e.g., keywords in the
sentence and objects in the image, can be fairly informative for semantics
understanding. Inspired by this, we propose a unified framework to take
advantage of the fine-grained information for zero-shot vision-language
learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our
experiments show that our framework outperforms former zero-shot methods on VQA
and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our
ablation studies confirm the effectiveness and generalizability of our proposed
method. Code will be available at https://github.com/ThreeSR/UniFine
Related papers
- Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - What does CLIP know about a red circle? Visual prompt engineering for
VLMs [116.8806079598019]
We explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text.
We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks.
arXiv Detail & Related papers (2023-04-13T17:58:08Z) - Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining
on Visual Language Understanding [13.300199242824934]
We investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning.
We propose a suite of visual language understanding tasks for probing the visual reasoning abilities of text encoder models.
We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks.
arXiv Detail & Related papers (2023-03-21T17:30:40Z) - Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment [23.072180427273544]
We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
arXiv Detail & Related papers (2022-11-14T11:12:19Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.