MMDG-Bench: A Benchmark for Multimodal Domain Generalization
Abstract Overview
This paper introduces MMDG-Bench, a benchmark for multimodal domain generalization that is designed to unify evaluation across both multimodal learning and domain generalization. The benchmark is organized around two complementary integration orders: DG then MML (D2M), which aligns each modality across domains before fusion, and MML then DG (M2D), which fuses modalities within each domain before enforcing domain invariance. It establishes shared protocols on two task families that had not been jointly standardized in prior work: video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. Within this setup, the authors instantiate ten benchmark variants by combining a fixed multimodal configuration with five domain generalization methods, and use the resulting experiments to compare methods, backbones, and framework choices under consistent conditions.
Novelty
The main novelty is the creation of a standardized benchmark for multimodal domain generalization that goes beyond the field's prior concentration on action recognition and ad hoc evaluation. A second distinctive contribution is the explicit formulation and comparison of two framework orderings, D2M and M2D, together with an analysis linking their relative suitability to the cross-domain stability of cross-modal relationships.
Results
Across the reported experiments, the structured MMDG variants generally outperform reproduced multimodal DG baselines. For action recognition with CNN backbones, the best mean on HAC rises from 66.78 to 72.12 and on EPIC-Kitchens from 64.80 to 66.98; for face anti-spoofing, the best mean performance improves from 21.92/84.35 or 20.42/83.83 (HTER/AUC) in reproduced baselines to 14.40 HTER and 90.05 AUC. The study also shows that M2D is stronger on HAC and face anti-spoofing, while D2M is more reliable on EPIC-Kitchens, and that explicit DG improves robustness under stronger backbones and missing-modality testing.
Key Points
- MMDG-Bench standardizes multimodal domain generalization evaluation across two distinct tasks, using unified protocols and ten variants built from consistent MML and DG components.
- The benchmark highlights that adding explicit domain generalization methods to multimodal learning yields more reliable cross-domain performance than multimodal-only baselines, including under backbone changes.
- Framework choice is not universal: D2M works better when cross-modal relations are stable across domains, whereas M2D is more effective when those relations vary across domains and under missing-modality face anti-spoofing tests.