Related papers: VideoGLUE: Video General Understanding Evaluation of Foundation Models

VideoGLUE: Video General Understanding Evaluation of Foundation Models

URL: http://arxiv.org/abs/2307.03166v2
Date: Fri, 1 Dec 2023 19:42:57 GMT
Title: VideoGLUE: Video General Understanding Evaluation of Foundation Models
Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
Abstract summary: We evaluate existing foundation models video understanding capabilities using a carefully designed experiment. We propose a VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks.
Score: 90.54934154766585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue.

Related papers

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making [32.050134958163184]
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels.<n>We propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs.<n>We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations.
arXiv Detail & Related papers (2025-07-15T21:49:49Z)
Enabling Time-series Foundation Model for Building Energy Forecasting via Contrastive Curriculum Learning [12.19823790689484]
We study the adaptation of foundation models (FMs) to building energy forecasting tasks. We propose a new textitcontrastive curriculum learning-based training method. Experiments show that our method can improve the zero/few-shot performance by 14.6% compared to the existing FMs.
arXiv Detail & Related papers (2024-12-23T05:07:06Z)
Specialized Foundation Models Struggle to Beat Supervised Baselines [60.23386520331143]
We look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow. We find that it is consistently possible to train simple supervised models that match or even outperform the latest foundation models.
arXiv Detail & Related papers (2024-11-05T04:10:59Z)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [28.883607056108605]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z)
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model [22.188795668927586]
Video Foundation Models (VFMs) have made significant progress recently. Existing benchmarks and evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. We build a comprehensive benchmark suite to address these issues, namely VideoEval.
arXiv Detail & Related papers (2024-07-09T01:49:08Z)
On the Evaluation of Speech Foundation Models for Spoken Language Understanding [87.52911510306011]
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. We ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs?
arXiv Detail & Related papers (2024-06-14T14:37:52Z)
FedPFT: Federated Proxy Fine-Tuning of Foundation Models [55.58899993272904]
Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges as a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. We propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules.
arXiv Detail & Related papers (2024-04-17T16:30:06Z)
Development of a Reliable and Accessible Caregiving Language Model (CaLM) [1.1487735059279973]
This study aimed to develop a reliable Caregiving Language Model (CaLM) by using FMs and a caregiving knowledge base. We developed CaLM using the Retrieval Augmented Generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers. The study shows that reliable and accessible CaLM can be developed by using small FMs with a knowledge base specific to the caregiving domain.
arXiv Detail & Related papers (2024-03-11T16:12:34Z)
Learn From Model Beyond Fine-Tuning: A Survey [78.80920533793595]
Learn From Model (LFM) focuses on the research, modification, and design of foundation models (FM) based on the model interface. The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing. This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM.
arXiv Detail & Related papers (2023-10-12T10:20:36Z)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z)
Leaf-FM: A Learnable Feature Generation Factorization Machine for Click-Through Rate Prediction [2.412497918389292]
We propose LeafFM model based on FM to generate new features from the original feature embedding by learning the transformation functions automatically. Experiments are conducted on three real-world datasets and the results show Leaf-FM model outperforms standard FMs by a large margin.
arXiv Detail & Related papers (2021-07-26T08:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.