Related papers: DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

URL: http://arxiv.org/abs/2504.01386v1
Date: Wed, 02 Apr 2025 05:56:57 GMT
Title: DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
Authors: Junjie Wu, Jiangtao Xie, Zhaolin Zhang, Qilong Wang, Qinghua Hu, Peihua Li, Sen Xu,
Abstract summary: Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data.<n>We propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data.
Score: 42.87396382273607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.

Related papers

Similarity-Based Domain Adaptation with LLMs [13.692329347889212]
Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data.<n>This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation.<n>Our framework achieves impressive performance, specifically, 2.44% accuracy improvement when compared to the SOTA method.
arXiv Detail & Related papers (2025-03-07T09:51:07Z)
DataMan: Data Manager for Pre-training Large Language Models [39.677609311769146]
Existing methods rely on limited intuition, lacking comprehensive and clear guidelines. We derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B- parameter language model.
arXiv Detail & Related papers (2025-02-26T18:01:19Z)
On Domain-Adaptive Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs via post-training.<n>We focus on data synthesis, training pipeline, and task evaluation.<n>We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
Domain Specific Data Distillation and Multi-modal Embedding Generation [0.0]
The challenge of creating domain-centric embeddings arises from the abundance of unstructured data and the scarcity of domain-specific structured data. This paper introduces a novel modeling approach that leverages structured data to filter noise from unstructured data, resulting in embeddings with high precision and recall for domain-specific attribute prediction.
arXiv Detail & Related papers (2024-10-27T03:47:46Z)
Precision at Scale: Domain-Specific Datasets On-Demand [3.5900418884504095]
Precision at Scale (PaS) is a novel method for the autonomous creation of domain-specific datasets on-demand. PaS pipeline enables leveraging state-of-the-art foundational and generative models to create a collection of images belonging to any given domain. We prove that automatically generated domain-specific datasets lead to better pretraining than large-scale supervised datasets such as ImageNet-1k and ImageNet-21k.
arXiv Detail & Related papers (2024-07-03T19:17:42Z)
WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation. We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart. We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z)
DG-TTA: Out-of-domain Medical Image Segmentation through Augmentation and Descriptor-driven Domain Generalization and Test-Time Adaptation [43.842694540544194]
Applying pretrained medical deep learning segmentation models on out-of-domain images often yields predictions of insufficient quality.<n>In this study, we propose to use a powerful generalizing descriptor along with augmentation to enable domain-generalized pretraining and test-time adaptation.
arXiv Detail & Related papers (2023-12-11T10:26:21Z)
Federated and Generalized Person Re-identification through Domain and Feature Hallucinating [88.77196261300699]
We study the problem of federated domain generalization (FedDG) for person re-identification (re-ID) We propose a novel method, called "Domain and Feature Hallucinating (DFH)", to produce diverse features for learning generalized local and global models. Our method achieves the state-of-the-art performance for FedDG on four large-scale re-ID benchmarks.
arXiv Detail & Related papers (2022-03-05T09:15:13Z)
TAL: Two-stream Adaptive Learning for Generalizable Person Re-identification [115.31432027711202]
We argue that both domain-specific and domain-invariant features are crucial for improving the generalization ability of re-id models. We name two-stream adaptive learning (TAL) to simultaneously model these two kinds of information. Our framework can be applied to both single-source and multi-source domain generalization tasks.
arXiv Detail & Related papers (2021-11-29T01:27:42Z)
Batch Normalization Embeddings for Deep Domain Generalization [50.51405390150066]
Domain generalization aims at training machine learning models to perform robustly across different and unseen domains. We show a significant increase in classification accuracy over current state-of-the-art techniques on popular domain generalization benchmarks.
arXiv Detail & Related papers (2020-11-25T12:02:57Z)
Multi-source Domain Adaptation for Visual Sentiment Classification [92.53780541232773]
We propose a novel multi-source domain adaptation (MDA) method, termed Multi-source Sentiment Generative Adversarial Network (MSGAN) To handle data from multiple source domains, MSGAN learns to find a unified sentiment latent space where data from both the source and target domains share a similar distribution. Extensive experiments conducted on four benchmark datasets demonstrate that MSGAN significantly outperforms the state-of-the-art MDA approaches for visual sentiment classification.
arXiv Detail & Related papers (2020-01-12T08:37:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.