UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding
- URL: http://arxiv.org/abs/2307.00862v1
- Date: Mon, 3 Jul 2023 09:03:12 GMT
- Title: UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding
- Authors: Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang,
Shih-Fu Chang
- Abstract summary: We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
- Score: 84.83494254263138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because
they require the model's reasoning ability to understand the semantics of the
visual world and natural language. Supervised methods working for
vision-language tasks have been well-studied. However, solving these tasks in a
zero-shot setting is less explored. Since Contrastive Language-Image
Pre-training (CLIP) has shown remarkable zero-shot performance on image-text
matching, previous works utilized its strong zero-shot ability by converting
vision-language tasks into an image-text matching problem, and they mainly
consider global-level matching (e.g., the whole image or sentence). However, we
find visual and textual fine-grained information, e.g., keywords in the
sentence and objects in the image, can be fairly informative for semantics
understanding. Inspired by this, we propose a unified framework to take
advantage of the fine-grained information for zero-shot vision-language
learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our
experiments show that our framework outperforms former zero-shot methods on VQA
and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our
ablation studies confirm the effectiveness and generalizability of our proposed
method. Code will be available at https://github.com/ThreeSR/UniFine
Related papers
- Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization [77.36122979882649]
Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP)
In this paper, we explore the idea that CV adopts discrete and terminological task definitions, which may be a key barrier to zero-shot task generalization.
Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks.
arXiv Detail & Related papers (2024-12-24T16:08:25Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.
It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.
It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - What does CLIP know about a red circle? Visual prompt engineering for
VLMs [116.8806079598019]
We explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text.
We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks.
arXiv Detail & Related papers (2023-04-13T17:58:08Z) - Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining
on Visual Language Understanding [13.300199242824934]
We investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning.
We propose a suite of visual language understanding tasks for probing the visual reasoning abilities of text encoder models.
We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks.
arXiv Detail & Related papers (2023-03-21T17:30:40Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.