ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
Video-Language Models
- URL: http://arxiv.org/abs/2311.07022v1
- Date: Mon, 13 Nov 2023 02:13:13 GMT
- Title: ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
Video-Language Models
- Authors: Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can
Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt,
Aykut Erdem, Erkut Erdem
- Abstract summary: We present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of VidLMs on a firm footing.
ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding.
We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images.
- Score: 28.305932427801682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the ever-increasing popularity of pretrained Video-Language Models
(VidLMs), there is a pressing need to develop robust evaluation methodologies
that delve deeper into their visio-linguistic capabilities. To address this
challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic
benchmark that places the assessment of fine-grained capabilities of these
models on a firm footing. Task-based evaluations, while valuable, fail to
capture the complexities and specific temporal aspects of moving images that
VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers
a controlled evaluation suite that sheds light on the true potential of these
models, as well as their performance gaps compared to human-level
understanding. ViLMA also includes proficiency tests, which assess basic
capabilities deemed essential to solving the main counterfactual tests. We show
that current VidLMs' grounding abilities are no better than those of
vision-language models which use static images. This is especially striking
once the performance on proficiency tests is factored in. Our benchmark serves
as a catalyst for future research on VidLMs, helping to highlight areas that
still need to be explored.
Related papers
- MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.
Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language
Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs.
Our findings reveal significant discrepancies in reasoning abilities across different models.
We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z) - VLM-Eval: A General Evaluation on Video Large Language Models [16.92780012093112]
We introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition.
We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs.
We evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning.
arXiv Detail & Related papers (2023-11-20T16:02:10Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.