VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
- URL: http://arxiv.org/abs/2407.06491v1
- Date: Tue, 9 Jul 2024 01:49:08 GMT
- Title: VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
- Authors: Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, Limin Wang,
- Abstract summary: Video Foundation Models (VFMs) have made significant progress recently.
Existing benchmarks and evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics.
We build a comprehensive benchmark suite to address these issues, namely VideoEval.
- Score: 22.188795668927586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating their remarkable performance on traditional video understanding benchmarks. However, the existing benchmarks (e.g. Kinetics) and their evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. In this paper, we build a comprehensive benchmark suite to address these issues, namely VideoEval. Specifically, we establish the Video Task Adaption Benchmark (VidTAB) and the Video Embedding Benchmark (VidEB) from two perspectives: evaluating the task adaptability of VFMs under few-shot conditions and assessing their representation power by directly applying to downstream tasks. With VideoEval, we conduct a large-scale study on 20 popular open-source vision foundation models. Our study reveals some insightful findings on VFMs: 1) overall, current VFMs exhibit weak generalization across diverse tasks, 2) increasing video data, whether labeled or weakly-labeled video-text pairs, does not necessarily improve task performance, 3) the effectiveness of some pre-training paradigms may not be fully validated in previous benchmarks, and 4) combining different pre-training paradigms can help improve the generalization capabilities. We believe this study serves as an important complement to the current evaluation for VFMs and offers valuable insights for the future research.
Related papers
- MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Foundation Models for Video Understanding: A Survey [26.52064059342181]
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks.
This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks.
arXiv Detail & Related papers (2024-05-06T18:09:48Z) - ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
Video-Language Models [28.305932427801682]
We present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of VidLMs on a firm footing.
ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding.
We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images.
arXiv Detail & Related papers (2023-11-13T02:13:13Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
How to evaluate large vision-language models remains a major obstacle, hindering future model development.
Traditional benchmarks provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics.
Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias.
MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - VideoGLUE: Video General Understanding Evaluation of Foundation Models [90.54934154766585]
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment.
We propose a VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks.
arXiv Detail & Related papers (2023-07-06T17:47:52Z) - Group-aware Contrastive Regression for Action Quality Assessment [85.43203180953076]
We show that the relations among videos can provide important clues for more accurate action quality assessment.
Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.
arXiv Detail & Related papers (2021-08-17T17:59:39Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Unsupervised Video Summarization via Multi-source Features [4.387757291346397]
Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video.
We propose the incorporation of multiple feature sources with chunk and stride fusion to provide more information about the visual content.
For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-26T13:12:46Z) - Realistic Video Summarization through VISIOCITY: A New Benchmark and
Evaluation Framework [15.656965429236235]
We take steps towards making automatic video summarization more realistic by addressing several challenges.
Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type.
We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories.
arXiv Detail & Related papers (2020-07-29T02:44:35Z) - UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated
Content [59.13821614689478]
Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of content are unpredictable, complicated, and often commingled.
Here we contribute to advancing the problem by conducting a comprehensive evaluation of leading VQA models.
By employing a feature selection strategy on top of leading VQA model features, we are able to extract 60 of the 763 statistical features used by the leading models.
Our experimental results show that VIDEVAL achieves state-of-theart performance at considerably lower computational cost than other leading models.
arXiv Detail & Related papers (2020-05-29T00:39:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.