ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
Video-Language Models
- URL: http://arxiv.org/abs/2311.07022v1
- Date: Mon, 13 Nov 2023 02:13:13 GMT
- Title: ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
Video-Language Models
- Authors: Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can
Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt,
Aykut Erdem, Erkut Erdem
- Abstract summary: We present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of VidLMs on a firm footing.
ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding.
We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images.
- Score: 28.305932427801682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the ever-increasing popularity of pretrained Video-Language Models
(VidLMs), there is a pressing need to develop robust evaluation methodologies
that delve deeper into their visio-linguistic capabilities. To address this
challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic
benchmark that places the assessment of fine-grained capabilities of these
models on a firm footing. Task-based evaluations, while valuable, fail to
capture the complexities and specific temporal aspects of moving images that
VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers
a controlled evaluation suite that sheds light on the true potential of these
models, as well as their performance gaps compared to human-level
understanding. ViLMA also includes proficiency tests, which assess basic
capabilities deemed essential to solving the main counterfactual tests. We show
that current VidLMs' grounding abilities are no better than those of
vision-language models which use static images. This is especially striking
once the performance on proficiency tests is factored in. Our benchmark serves
as a catalyst for future research on VidLMs, helping to highlight areas that
still need to be explored.
Related papers
- AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language
Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs.
Our findings reveal significant discrepancies in reasoning abilities across different models.
We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - VLM-Eval: A General Evaluation on Video Large Language Models [16.92780012093112]
We introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition.
We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs.
We evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning.
arXiv Detail & Related papers (2023-11-20T16:02:10Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.