VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
- URL: http://arxiv.org/abs/2205.15237v1
- Date: Mon, 30 May 2022 16:52:30 GMT
- Title: VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
- Authors: Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang
- Abstract summary: Recent advances in vision-language pre-training have demonstrated impressive performance in a range of vision-language tasks.
We introduce the Vision-Language Understanding Evaluation benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off.
- Score: 21.549122658275383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision-language pre-training (VLP) have demonstrated
impressive performance in a range of vision-language (VL) tasks. However, there
exist several challenges for measuring the community's progress in building
general multi-modal intelligence. First, most of the downstream VL datasets are
annotated using raw images that are already seen during pre-training, which may
result in an overestimation of current VLP models' generalization ability.
Second, recent VLP work mainly focuses on absolute performance but overlooks
the efficiency-performance trade-off, which is also an important indicator for
measuring progress.
To this end, we introduce the Vision-Language Understanding Evaluation (VLUE)
benchmark, a multi-task multi-dimension benchmark for evaluating the
generalization capabilities and the efficiency-performance trade-off (``Pareto
SOTA'') of VLP models. We demonstrate that there is a sizable generalization
gap for all VLP models when testing on out-of-distribution test sets annotated
on images from a more diverse distribution that spreads across cultures.
Moreover, we find that measuring the efficiency-performance trade-off of VLP
models leads to complementary insights for several design choices of VLP. We
release the VLUE benchmark to promote research on building vision-language
models that generalize well to more diverse images and concepts unseen during
pre-training, and are practical in terms of efficiency-performance trade-off.
Related papers
- Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models [42.79282247484499]
Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue.<n>Recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge.<n>We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks.
arXiv Detail & Related papers (2026-02-19T22:07:29Z) - Vision-aligned Latent Reasoning for Multi-modal Large Language Model [82.26044667101011]
Vision-aligned Latent Reasoning (VaLR) is a framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step.<n>VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders.
arXiv Detail & Related papers (2026-02-04T12:04:02Z) - Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z) - VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models [43.09726338623949]
Vision-Language-Action (VLA) models integrate pretrained large Vision-Language Models (VLM) into their policy backbone.<n>This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance.<n>We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters.
arXiv Detail & Related papers (2026-01-06T09:58:24Z) - Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs [4.296395082987112]
Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks.<n>Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts.<n>We introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs.
arXiv Detail & Related papers (2025-09-20T11:12:23Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images.
Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.
We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension [95.63899307791665]
Vision Value Model (VisVM) can guide VLM inference-time search to generate responses with better visual comprehension.
In this paper, we present VisVM that can guide VLM inference-time search to generate responses with better visual comprehension.
arXiv Detail & Related papers (2024-12-04T20:35:07Z) - VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.
Current assessment methods rely on AI-annotated preference labels from traditional tasks.
We introduce VL-RewardBench, a benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z) - Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images.
Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning.
We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z) - What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations [28.322824790738768]
Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
arXiv Detail & Related papers (2022-07-01T06:25:53Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.