Diversifying Spatial-Temporal Perception for Video Domain Generalization
- URL: http://arxiv.org/abs/2310.17942v1
- Date: Fri, 27 Oct 2023 07:36:36 GMT
- Title: Diversifying Spatial-Temporal Perception for Video Domain Generalization
- Authors: Kun-Yu Lin, Jia-Run Du, Yipeng Gao, Jiaming Zhou, Wei-Shi Zheng
- Abstract summary: Video domain generalization aims to learn generalizable video classification models for unseen target domains by training in a source domain.
We propose to perceive diverse spatial-temporal cues in videos, aiming to discover potential domain-invariant cues in addition to domain-specific cues.
- Score: 32.49202592793828
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Video domain generalization aims to learn generalizable video classification
models for unseen target domains by training in a source domain. A critical
challenge of video domain generalization is to defend against the heavy
reliance on domain-specific cues extracted from the source domain when
recognizing target videos. To this end, we propose to perceive diverse
spatial-temporal cues in videos, aiming to discover potential domain-invariant
cues in addition to domain-specific cues. We contribute a novel model named
Spatial-Temporal Diversification Network (STDN), which improves the diversity
from both space and time dimensions of video data. First, our STDN proposes to
discover various types of spatial cues within individual frames by spatial
grouping. Then, our STDN proposes to explicitly model spatial-temporal
dependencies between video contents at multiple space-time scales by
spatial-temporal relation modeling. Extensive experiments on three benchmarks
of different types demonstrate the effectiveness and versatility of our
approach.
Related papers
- VUDG: A Dataset for Video Understanding Domain Generalization [29.27464392754555]
Video Understanding Domain Generalization (VUDG) is an annotated dataset designed specifically for evaluating the DG performance in video understanding.<n>VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation.
arXiv Detail & Related papers (2025-05-30T08:39:36Z) - Single-Domain Generalized Object Detection by Balancing Domain Diversity and Invariance [4.782038032310931]
Single-domain generalization for object detection (S-DGOD) aims to transfer knowledge from a single source domain to unseen target domains.
Due to the inherent diversity across domains, an excessive emphasis on invariance can cause the model to overlook the actual differences between images.
arXiv Detail & Related papers (2025-02-06T07:41:24Z) - Exploiting Aggregation and Segregation of Representations for Domain Adaptive Human Pose Estimation [50.31351006532924]
Human pose estimation (HPE) has received increasing attention recently due to its wide application in motion analysis, virtual reality, healthcare, etc.
It suffers from the lack of labeled diverse real-world datasets due to the time- and labor-intensive annotation.
We introduce a novel framework that capitalizes on both representation aggregation and segregation for domain adaptive human pose estimation.
arXiv Detail & Related papers (2024-12-29T17:59:45Z) - Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding [59.599378814835205]
Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
arXiv Detail & Related papers (2023-12-21T07:49:27Z) - Unified Domain Adaptive Semantic Segmentation [96.74199626935294]
Unsupervised Adaptive Domain Semantic (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain.
We propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies.
Our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.
arXiv Detail & Related papers (2023-11-22T09:18:49Z) - Aggregation of Disentanglement: Reconsidering Domain Variations in
Domain Generalization [9.577254317971933]
We argue that the domain variantions also contain useful information, ie, classification-aware information, for downstream tasks.
We propose a novel paradigm called Domain Disentanglement Network (DDN) to disentangle the domain expert features from the source domain images.
We also propound a new contrastive learning method to guide the domain expert features to form a more balanced and separable feature space.
arXiv Detail & Related papers (2023-02-05T09:48:57Z) - Unsupervised Video Domain Adaptation for Action Recognition: A
Disentanglement Perspective [37.45565756522847]
We consider the generation of cross-domain videos from two sets of latent factors.
TranSVAE framework is then developed to model such generation.
Experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE.
arXiv Detail & Related papers (2022-08-15T17:59:31Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Domain Adaptive Video Segmentation via Temporal Consistency
Regularization [32.77436219094282]
This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR)
The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning.
The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames.
arXiv Detail & Related papers (2021-07-23T02:50:42Z) - Structured Latent Embeddings for Recognizing Unseen Classes in Unseen
Domains [108.11746235308046]
We propose a novel approach that learns domain-agnostic structured latent embeddings by projecting images from different domains.
Our experiments on the challenging DomainNet and DomainNet-LS benchmarks show the superiority of our approach over existing methods.
arXiv Detail & Related papers (2021-07-12T17:57:46Z) - Channel-wise Alignment for Adaptive Object Detection [66.76486843397267]
Generic object detection has been immensely promoted by the development of deep convolutional neural networks.
Existing methods on this task usually draw attention on the high-level alignment based on the whole image or object of interest.
In this paper, we realize adaptation from a thoroughly different perspective, i.e., channel-wise alignment.
arXiv Detail & Related papers (2020-09-07T02:42:18Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.