VALSE: A Task-Independent Benchmark for Vision and Language Models
Centered on Linguistic Phenomena
- URL: http://arxiv.org/abs/2112.07566v1
- Date: Tue, 14 Dec 2021 17:15:04 GMT
- Title: VALSE: A Task-Independent Benchmark for Vision and Language Models
Centered on Linguistic Phenomena
- Authors: Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank,
Iacer Calixto, Albert Gatt
- Abstract summary: VALSE (Vision And Language Structured Evaluation) is a novel benchmark for testing general-purpose pretrained vision and language (V&L) models.
VALSE offers a suite of six tests covering various linguistic constructs.
We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models.
- Score: 15.984927623688915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose VALSE (Vision And Language Structured Evaluation), a novel
benchmark designed for testing general-purpose pretrained vision and language
(V&L) models for their visio-linguistic grounding capabilities on specific
linguistic phenomena. VALSE offers a suite of six tests covering various
linguistic constructs. Solving these requires models to ground linguistic
phenomena in the visual modality, allowing more fine-grained evaluations than
hitherto possible. We build VALSE using methods that support the construction
of valid foils, and report results from evaluating five widely-used V&L models.
Our experiments suggest that current models have considerable difficulty
addressing most phenomena. Hence, we expect VALSE to serve as an important
benchmark to measure future progress of pretrained V&L models from a linguistic
perspective, complementing the canonical task-centred V&L evaluations.
Related papers
- Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o.
GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities.
The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z) - Uncertainty-Aware Evaluation for Vision-Language Models [0.0]
Current evaluation methods overlook an essential component: uncertainty.
We show that models with the highest accuracy may also have the highest uncertainty.
Our empirical findings also reveal a correlation between model uncertainty and its language model part.
arXiv Detail & Related papers (2024-02-22T10:04:17Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Establishing Trustworthiness: Rethinking Tasks and Model Evaluation [36.329415036660535]
We argue that it is time to rethink what constitutes tasks and model evaluation in NLP.
We review existing compartmentalized approaches for understanding the origins of a model's functional capacity.
arXiv Detail & Related papers (2023-10-09T06:32:10Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding [1.827510863075184]
Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
arXiv Detail & Related papers (2022-04-13T10:32:03Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - Effect of Vision-and-Language Extensions on Natural Language
Understanding in Vision-and-Language Models [24.5834345625595]
This paper investigates how visual extension affects the language capability of V&L models using the GLUE benchmark.
We found that visual extension causes some decreases in language capability and that V&L pretraining has a greater impact than structural modifications on the decreases.
arXiv Detail & Related papers (2021-04-16T12:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.