Reasoning-Driven Multimodal LLM for Domain Generalization
- URL: http://arxiv.org/abs/2602.23777v1
- Date: Fri, 27 Feb 2026 08:10:06 GMT
- Title: Reasoning-Driven Multimodal LLM for Domain Generalization
- Authors: Zhipeng Xu, Zilong Wang, Xinyang Jiang, Dongsheng Li, De Cheng, Nannan Wang,
- Abstract summary: We study the role of reasoning in domain generalization using DomainBed-Reasoning dataset.<n>We propose RD-MLDG, a framework with two components: MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization)<n>Experiments on standard DomainBed datasets demonstrate that RD-MLDG achieves complementary state-of-the-art performances.
- Score: 72.00754603114187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: (i) fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc) demonstrate that RD-MLDG achieves state-of-the-art performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization.
Related papers
- Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z) - Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization [52.52838658375592]
We propose a new paradigm, domain-connecting contrastive learning (DCCL) to enhance conceptual connectivity across domains.<n>On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity.<n>The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision.
arXiv Detail & Related papers (2025-10-19T04:13:29Z) - AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization [43.86757207244911]
We propose a comprehensive framework addressing limitations through two synergistic innovations.<n>First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination.<n>Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision.
arXiv Detail & Related papers (2025-08-06T08:00:27Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - Disentangling Masked Autoencoders for Unsupervised Domain Generalization [57.56744870106124]
Unsupervised domain generalization is fast gaining attention but is still far from well-studied.
Disentangled Masked Auto (DisMAE) aims to discover the disentangled representations that faithfully reveal intrinsic features.
DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders.
arXiv Detail & Related papers (2024-07-10T11:11:36Z) - Rethinking Multi-domain Generalization with A General Learning Objective [17.155829981870045]
Multi-domain generalization (mDG) is universally aimed to minimize discrepancy between training and testing distributions.<n>Existing mDG literature lacks a general learning objective paradigm.<n>We propose to leverage a $Y$-mapping to relax the constraint.
arXiv Detail & Related papers (2024-02-29T05:00:30Z) - Consistency Regularization for Domain Generalization with Logit Attribution Matching [14.98337914353095]
Domain generalization (DG) is about training models that generalize well under domain shift.
We consider a third, lesser-known setting where a training domain is endowed with a collection of pairs of examples that share the same semantic information.
We present a theory showing consistency regularization is conducive to DG and propose a novel CR method called Logit Matching.
arXiv Detail & Related papers (2023-05-13T10:21:53Z) - Diversity Boosted Learning for Domain Generalization with Large Number
of Domains [4.711430413139393]
We show that Diversity boosted twO-level saMplIng framework helps train robust models against spurious correlations from both domain-side and object-side.
We show that DOMI helps train robust models against spurious correlations on rotated MNIST, rotated Fashion MNIST, and iwildcam datasets.
arXiv Detail & Related papers (2022-07-28T02:58:17Z) - Compound Domain Generalization via Meta-Knowledge Encoding [55.22920476224671]
We introduce Style-induced Domain-specific Normalization (SDNorm) to re-normalize the multi-modal underlying distributions.
We harness the prototype representations, the centroids of classes, to perform relational modeling in the embedding space.
Experiments on four standard Domain Generalization benchmarks reveal that COMEN exceeds the state-of-the-art performance without the need of domain supervision.
arXiv Detail & Related papers (2022-03-24T11:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.