Scalable Performance Analysis for Vision-Language Models
- URL: http://arxiv.org/abs/2305.18786v2
- Date: Wed, 31 May 2023 17:55:44 GMT
- Title: Scalable Performance Analysis for Vision-Language Models
- Authors: Santiago Castro and Oana Ignat and Rada Mihalcea
- Abstract summary: Joint vision-language models have shown great performance over a diverse set of tasks.
Our paper introduces a more scalable solution that relies on already annotated benchmarks.
We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs.
- Score: 26.45624201546282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Joint vision-language models have shown great performance over a diverse set
of tasks. However, little is known about their limitations, as the high
dimensional space learned by these models makes it difficult to identify
semantic errors. Recent work has addressed this problem by designing highly
controlled probing task benchmarks. Our paper introduces a more scalable
solution that relies on already annotated benchmarks. Our method consists of
extracting a large set of diverse features from a vision-language benchmark and
measuring their correlation with the output of the target model. We confirm
previous findings that CLIP behaves like a bag of words model and performs
better with nouns and verbs; we also uncover novel insights such as CLIP
getting confused by concrete words. Our framework is available at
https://github.com/MichiganNLP/Scalable-VLM-Probing and can be used with other
multimodal models and benchmarks.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models [10.482557806309174]
We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on semantic phrase processing tasks.
Thanks to ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks.
Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension.
arXiv Detail & Related papers (2024-05-05T09:20:38Z) - CLoVe: Encoding Compositional Language in Contrastive Vision-Language
Models [33.80107512462935]
Foundational Vision-Language Models (VLMs) excel at object-centric recognition yet learn text representations that seem invariant to word order.
No evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully.
In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language.
arXiv Detail & Related papers (2024-02-22T23:42:25Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken
Language Understanding [13.352795145385645]
Large pretrained language models have demonstrated strong language understanding capabilities.
We evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks.
We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors.
arXiv Detail & Related papers (2023-05-22T21:59:26Z) - CLUES: Few-Shot Learning Evaluation in Natural Language Understanding [81.63968985419982]
We introduce CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models.
We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.
arXiv Detail & Related papers (2021-11-04T00:43:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.