VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations
- URL: http://arxiv.org/abs/2207.00221v2
- Date: Thu, 22 Jun 2023 16:55:44 GMT
- Title: VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations
- Authors: Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee,
Xiaopeng Lu, Jianwei Yin
- Abstract summary: Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
- Score: 28.322824790738768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Pretraining (VLP) models have recently successfully
facilitated many cross-modal downstream tasks. Most existing works evaluated
their systems by comparing the fine-tuned downstream task performance. However,
only average downstream task accuracy provides little information about the
pros and cons of each VLP method, let alone provides insights on how the
community can improve the systems in the future. Inspired by the CheckList for
testing natural language processing, we exploit VL-CheckList, a novel framework
to understand the capabilities of VLP models. The proposed method divides the
image-texting ability of a VLP model into three categories: objects,
attributes, and relations, and uses a novel taxonomy to further break down
these three aspects. We conduct comprehensive studies to analyze seven recently
popular VLP models via the proposed framework. Results confirm the
effectiveness of the proposed method by revealing fine-grained differences
among the compared models that were not visible from downstream task-only
evaluation. Further results show promising research direction in building
better VLP models. Our data and code are available at:
https://github.com/om-ai-lab/VL-CheckList.
Related papers
- Active Learning for Vision-Language Models [29.309503214127016]
We propose a novel active learning (AL) framework that enhances the zero-shot classification performance of vision-language models (VLMs)
Our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection.
Our experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets.
arXiv Detail & Related papers (2024-10-29T16:25:50Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective [52.52870614418373]
Aligning cross-modal semantics is claimed to be one of the essential capabilities of vision and language pre-training models.
We propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of fjord models.
arXiv Detail & Related papers (2022-10-18T02:55:58Z) - Counterfactually Measuring and Eliminating Social Bias in
Vision-Language Pre-training Models [13.280828458515062]
We introduce a counterfactual-based bias measurement emphCounterBias to quantify the social bias in Vision-Language Pre-training models.
We also construct a novel VL-Bias dataset including 24K image-text pairs for measuring gender bias.
arXiv Detail & Related papers (2022-07-03T14:39:32Z) - VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models [21.549122658275383]
Recent advances in vision-language pre-training have demonstrated impressive performance in a range of vision-language tasks.
We introduce the Vision-Language Understanding Evaluation benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off.
arXiv Detail & Related papers (2022-05-30T16:52:30Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.