Robust Domain Generalization for Multi-modal Object Recognition
- URL: http://arxiv.org/abs/2408.05831v1
- Date: Sun, 11 Aug 2024 17:13:21 GMT
- Title: Robust Domain Generalization for Multi-modal Object Recognition
- Authors: Yuxin Qiao, Keqin Li, Junhong Lin, Rong Wei, Chufeng Jiang, Yang Luo, Haoyu Yang,
- Abstract summary: In multi-label classification, machine learning encounters the challenge of domain generalization when handling tasks with differing distributions from the training data.
Recent advancements in vision-language pre-training leverage supervision from extensive visual-language pairs, enabling learning across diverse domains.
This paper proposes solutions by inferring the actual loss, broadening evaluations to larger vision-language backbones, and introducing Mixup-CLIPood.
- Score: 14.128747255526012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In multi-label classification, machine learning encounters the challenge of domain generalization when handling tasks with distributions differing from the training data. Existing approaches primarily focus on vision object recognition and neglect the integration of natural language. Recent advancements in vision-language pre-training leverage supervision from extensive visual-language pairs, enabling learning across diverse domains and enhancing recognition in multi-modal scenarios. However, these approaches face limitations in loss function utilization, generality across backbones, and class-aware visual fusion. This paper proposes solutions to these limitations by inferring the actual loss, broadening evaluations to larger vision-language backbones, and introducing Mixup-CLIPood, which incorporates a novel mix-up loss for enhanced class-aware visual fusion. Our method demonstrates superior performance in domain generalization across multiple datasets.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Transitive Vision-Language Prompt Learning for Domain Generalization [41.484858946789664]
The vision-language pre-training has enabled deep models to make a huge step forward in generalizing across unseen domains.
However, there are still some issues that an advancement still suffers from trading-off between domain invariance and class separability.
arXiv Detail & Related papers (2024-04-29T14:56:11Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Split to Merge: Unifying Separated Modalities for Unsupervised Domain
Adaptation [25.499205902426716]
We introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation.
We craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components.
Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information.
arXiv Detail & Related papers (2024-03-11T17:33:12Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-Scale and Multi-Layer Contrastive Learning for Domain Generalization [5.124256074746721]
We argue that the generalization ability of deep convolutional neural networks can be improved by taking advantage of multi-layer and multi-scaled representations of the network.
We introduce a framework that aims at improving domain generalization of image classifiers by combining both low-level and high-level features at multiple scales.
We show that our model is able to surpass the performance of previous DG methods and consistently produce competitive and state-of-the-art results in all datasets.
arXiv Detail & Related papers (2023-08-28T08:54:27Z) - OmDet: Large-scale vision-language multi-dataset pre-training with
multimodal detection network [17.980765138522322]
This work introduces OmDet, a novel language-aware object detection architecture.
Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets.
We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding.
arXiv Detail & Related papers (2022-09-10T14:25:14Z) - INDIGO: Intrinsic Multimodality for Domain Generalization [26.344372409315177]
We study how multimodal information can be leveraged in an "intrinsic" way to make systems generalize under unseen domains.
We propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO)
arXiv Detail & Related papers (2022-06-13T05:41:09Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Deep Partial Multi-View Learning [94.39367390062831]
We propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets)
We fifirst provide a formal defifinition of completeness and versatility for multi-view representation.
We then theoretically prove the versatility of the learned latent representations.
arXiv Detail & Related papers (2020-11-12T02:29:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.