VUDG: A Dataset for Video Understanding Domain Generalization
- URL: http://arxiv.org/abs/2505.24346v1
- Date: Fri, 30 May 2025 08:39:36 GMT
- Title: VUDG: A Dataset for Video Understanding Domain Generalization
- Authors: Ziyi Wang, Zhi Gao, Boxuan Yu, Zirui Dai, Yuxiang Song, Qingyuan Lu, Jin Chen, Xinxiao Wu,
- Abstract summary: Video Understanding Domain Generalization (VUDG) is an annotated dataset designed specifically for evaluating the DG performance in video understanding.<n>VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation.
- Score: 29.27464392754555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Generalization (VUDG), a novel dataset designed specifically for evaluating the DG performance in video understanding. VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to annotate each video with both multiple-choice and open-ended question-answer pairs. Extensive experiments on 9 representative large video-language models (LVLMs) and several traditional video question answering methods show that most models (including state-of-the-art LVLMs) suffer performance degradation under domain shifts. These results highlight the challenges posed by VUDG and the difference in the robustness of current models to data distribution shifts. We believe VUDG provides a valuable resource for prompting future research in domain generalization video understanding.
Related papers
- Consistent and Invariant Generalization Learning for Short-video Misinformation Detection [10.402862106017965]
Short-video misinformation detection has attracted wide attention in the multi-modal domain.<n>Current models often exhibit unsatisfactory performance on unseen domains due to domain gaps.<n>We propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection.
arXiv Detail & Related papers (2025-07-05T14:53:32Z) - ReferEverything: Towards Segmenting Everything We Can Speak of in Videos [42.88584315033116]
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language.<n>Our key insight is to preserve the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents.<n>REM performs on par with the state-of-the-art on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 12 IoU points out-of-domain.
arXiv Detail & Related papers (2024-10-30T17:59:26Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Diversifying Spatial-Temporal Perception for Video Domain Generalization [32.49202592793828]
Video domain generalization aims to learn generalizable video classification models for unseen target domains by training in a source domain.
We propose to perceive diverse spatial-temporal cues in videos, aiming to discover potential domain-invariant cues in addition to domain-specific cues.
arXiv Detail & Related papers (2023-10-27T07:36:36Z) - Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments.
Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains.
We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z) - Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey [42.22801056661226]
Video analysis tasks such as action recognition have received increasing research interest with growing applications in fields such as smart healthcare.
Video models trained on existing datasets suffer from significant performance degradation when deployed directly to real-world applications.
Video unsupervised domain adaptation (VUDA) is introduced to adapt video models from the labeled source domain to the unlabeled target domain.
arXiv Detail & Related papers (2022-11-17T05:05:42Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Unsupervised Domain Adaptation for Video Transformers in Action
Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition.
Our approach builds a robust source model that better generalises to target domain.
We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Generative Adversarial Networks for Video-to-Video Domain Adaptation [32.670977389990306]
We propose a novel generative adversarial network (GAN), namely VideoGAN, to transfer the video-based data across different domains.
As the frames of a video may have similar content and imaging conditions, the proposed VideoGAN has an X-shape generator to preserve the intra-video consistency.
Two colonoscopic datasets from different centres, i.e., CVC-Clinic and ETIS-Larib, are adopted to evaluate the performance of our VideoGAN.
arXiv Detail & Related papers (2020-04-17T04:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.