VMDT: Decoding the Trustworthiness of Video Foundation Models
- URL: http://arxiv.org/abs/2511.05682v1
- Date: Fri, 07 Nov 2025 19:56:00 GMT
- Title: VMDT: Decoding the Trustworthiness of Video Foundation Models
- Authors: Yujin Potter, Zhun Wang, Nicholas Crispino, Kyle Montgomery, Alexander Xiong, Ethan Y. Chang, Francesco Pinto, Yuqi Chen, Rahul Gupta, Morteza Ziyadi, Christos Christodoulopoulos, Bo Li, Chenguang Wang, Dawn Song,
- Abstract summary: We introduce VMDT, the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models.<n>Through our evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights.
- Score: 77.90980744982079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve -- though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.
Related papers
- VidLeaks: Membership Inference Attacks Against Text-to-Video Models [17.443499650679964]
Membership inference attacks (MIAs) provide a principled tool for auditing copyright and privacy violations.<n>We introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals.<n>Our work provides the first concrete evidence that T2V leak models substantial membership information through both sparse memorization and temporal memorization.
arXiv Detail & Related papers (2026-01-16T11:35:52Z) - T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models [101.70140132374307]
Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants.<n>Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy.<n>We present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs.
arXiv Detail & Related papers (2025-03-19T01:59:44Z) - ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models [13.04745908368858]
We introduce ViBe: a large-scale dataset of hallucinated videos from open-source T2V models.<n>Using ten T2V models, we generated and manually annotated 3,782 videos from 837 MS captions.<n>Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings.
arXiv Detail & Related papers (2024-11-16T19:23:12Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models [32.6243916760583]
We present a framework for measuring two core capabilities of video comprehension: appearance and motion understanding.
We introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos.
Our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L)
arXiv Detail & Related papers (2024-08-21T03:56:27Z) - T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models [39.15695612766001]
We introduce T2VSafetyBench, a new benchmark for safety-critical assessments of text-to-video models.
We define 12 critical aspects of video generation safety and construct a malicious prompt dataset.
No single model excels in all aspects, with different models showing various strengths.
There is a trade-off between the usability and safety of text-to-video generative models.
arXiv Detail & Related papers (2024-07-08T14:04:58Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.