FuguReport

MMDG-Bench: A Benchmark for Multimodal Domain Generalization

Authors Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu
Affiliations Samsung AI Centre Cambridge / Jiyue AI / The University of Manchester / University of Surrey
Categories Evaluation / Benchmarking / Multimodal domain generalization evaluation protocol, Task / Domain Generalization / Generalization to unseen domains via multimodality, Method / Multimodal Learning / Combining complementary modalities for robustness
License CC BY 4.0

Abstract Overview

This paper introduces MMDG-Bench, a benchmark for multimodal domain generalization that is designed to unify evaluation across both multimodal learning and domain generalization. The benchmark is organized around two complementary integration orders: DG then MML (D2M), which aligns each modality across domains before fusion, and MML then DG (M2D), which fuses modalities within each domain before enforcing domain invariance. It establishes shared protocols on two task families that had not been jointly standardized in prior work: video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. Within this setup, the authors instantiate ten benchmark variants by combining a fixed multimodal configuration with five domain generalization methods, and use the resulting experiments to compare methods, backbones, and framework choices under consistent conditions.

Novelty

The main novelty is the creation of a standardized benchmark for multimodal domain generalization that goes beyond the field's prior concentration on action recognition and ad hoc evaluation. A second distinctive contribution is the explicit formulation and comparison of two framework orderings, D2M and M2D, together with an analysis linking their relative suitability to the cross-domain stability of cross-modal relationships.

Results

Across the reported experiments, the structured MMDG variants generally outperform reproduced multimodal DG baselines. For action recognition with CNN backbones, the best mean on HAC rises from 66.78 to 72.12 and on EPIC-Kitchens from 64.80 to 66.98; for face anti-spoofing, the best mean performance improves from 21.92/84.35 or 20.42/83.83 (HTER/AUC) in reproduced baselines to 14.40 HTER and 90.05 AUC. The study also shows that M2D is stronger on HAC and face anti-spoofing, while D2M is more reliable on EPIC-Kitchens, and that explicit DG improves robustness under stronger backbones and missing-modality testing.

Key Points

  1. MMDG-Bench standardizes multimodal domain generalization evaluation across two distinct tasks, using unified protocols and ten variants built from consistent MML and DG components.
  2. The benchmark highlights that adding explicit domain generalization methods to multimodal learning yields more reliable cross-domain performance than multimodal-only baselines, including under backbone changes.
  3. Framework choice is not universal: D2M works better when cross-modal relations are stable across domains, whereas M2D is more effective when those relations vary across domains and under missing-modality face anti-spoofing tests.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.