VideoGLUE: Video General Understanding Evaluation of Foundation Models
- URL: http://arxiv.org/abs/2307.03166v2
- Date: Fri, 1 Dec 2023 19:42:57 GMT
- Title: VideoGLUE: Video General Understanding Evaluation of Foundation Models
- Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin
Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail
Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang,
Ting Liu, Boqing Gong
- Abstract summary: We evaluate existing foundation models video understanding capabilities using a carefully designed experiment.
We propose a VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks.
- Score: 90.54934154766585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We evaluate existing foundation models video understanding capabilities using
a carefully designed experiment protocol consisting of three hallmark tasks
(action recognition, temporal localization, and spatiotemporal localization),
eight datasets well received by the community, and four adaptation methods
tailoring a foundation model (FM) for a downstream task. Moreover, we propose a
scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when
adapting to general video understanding tasks. Our main findings are as
follows. First, task-specialized models significantly outperform the six FMs
studied in this work, in sharp contrast to what FMs have achieved in natural
language and image understanding. Second,video-native FMs, whose pretraining
data contains the video modality, are generally better than image-native FMs in
classifying motion-rich videos, localizing actions in time, and understanding a
video of more than one action. Third, the video-native FMs can perform well on
video tasks under light adaptations to downstream tasks(e.g., freezing the FM
backbones), while image-native FMs win in full end-to-end finetuning. The first
two observations reveal the need and tremendous opportunities to conduct
research on video-focused FMs, and the last confirms that both tasks and
adaptation methods matter when it comes to the evaluation of FMs. Our code is
released under:
https://github.com/tensorflow/models/tree/master/official/projects/videoglue.
Related papers
- VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model [22.188795668927586]
Video Foundation Models (VFMs) have made significant progress recently.
Existing benchmarks and evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics.
We build a comprehensive benchmark suite to address these issues, namely VideoEval.
arXiv Detail & Related papers (2024-07-09T01:49:08Z) - On the Evaluation of Speech Foundation Models for Spoken Language Understanding [87.52911510306011]
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking.
The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks.
We ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs?
arXiv Detail & Related papers (2024-06-14T14:37:52Z) - Foundation Models for Video Understanding: A Survey [26.52064059342181]
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks.
This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks.
arXiv Detail & Related papers (2024-05-06T18:09:48Z) - FedPFT: Federated Proxy Fine-Tuning of Foundation Models [55.58899993272904]
Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges as a promising strategy for protecting data privacy and valuable FMs.
Existing methods fine-tune FM by allocating sub-FM to clients in FL, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients.
We propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules.
arXiv Detail & Related papers (2024-04-17T16:30:06Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Learn From Model Beyond Fine-Tuning: A Survey [78.80920533793595]
Learn From Model (LFM) focuses on the research, modification, and design of foundation models (FM) based on the model interface.
The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing.
This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM.
arXiv Detail & Related papers (2023-10-12T10:20:36Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Leaf-FM: A Learnable Feature Generation Factorization Machine for
Click-Through Rate Prediction [2.412497918389292]
We propose LeafFM model based on FM to generate new features from the original feature embedding by learning the transformation functions automatically.
Experiments are conducted on three real-world datasets and the results show Leaf-FM model outperforms standard FMs by a large margin.
arXiv Detail & Related papers (2021-07-26T08:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.