Modest-Align: Data-Efficient Alignment for Vision-Language Models
- URL: http://arxiv.org/abs/2510.21606v1
- Date: Fri, 24 Oct 2025 16:11:10 GMT
- Title: Modest-Align: Data-Efficient Alignment for Vision-Language Models
- Authors: Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Mingkun Xu, Zuozhu Liu,
- Abstract summary: Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
- Score: 67.48633659305592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce overconfidence and improve performance on noisy or weakly aligned samples. Extensive experiments across multiple benchmark datasets demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval tasks, achieving competitive results with over 100x less training data and 600x less GPU time than CLIP. Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
Related papers
- CTTVAE: Latent Space Structuring for Conditional Tabular Data Generation on Imbalanced Datasets [0.0]
We introduce CTTVAE, a Conditional Transformer-based Tabular Variational Autoencoder equipped with two complementary mechanisms.<n>CTTVAE+TBS consistently yields more representative and utility-aligned samples without destabilizing training.
arXiv Detail & Related papers (2026-02-03T15:25:26Z) - Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data [67.25796812343454]
Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise.<n>We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights.<n>Experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
arXiv Detail & Related papers (2025-10-09T13:05:27Z) - MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning [14.06705718861471]
Multimodal models often over-rely on dominant modalities, failing to achieve optimal performance.<n>We propose MIDAS, a novel data augmentation strategy that generates misaligned samples with semantically inconsistent cross-modal information.<n>Experiments on multiple multimodal classification benchmarks demonstrate that MIDAS significantly outperforms related baselines in addressing modality imbalance.
arXiv Detail & Related papers (2025-09-30T06:13:17Z) - Confidence-Aware Self-Distillation for Multimodal Sentiment Analysis with Incomplete Modalities [15.205192581534973]
Multimodal sentiment analysis aims to understand human sentiment through multimodal data.<n>Existing methods for handling modality missingness are based on data reconstruction or common subspace projections.<n>We propose a Confidence-Aware Self-Distillation (CASD) strategy that effectively incorporates multimodal probabilistic embeddings.
arXiv Detail & Related papers (2025-06-02T09:48:41Z) - GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval [13.928213494843744]
Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples.<n>Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data.<n>We introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions.
arXiv Detail & Related papers (2025-05-19T16:25:55Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Modeling Multimodal Aleatoric Uncertainty in Segmentation with Mixture
of Stochastic Expert [24.216869988183092]
We focus on capturing the data-inherent uncertainty (aka aleatoric uncertainty) in segmentation, typically when ambiguities exist in input images.
We propose a novel mixture of experts (MoSE) model, where each expert network estimates a distinct mode of aleatoric uncertainty.
We develop a Wasserstein-like loss that directly minimizes the distribution distance between the MoSE and ground truth annotations.
arXiv Detail & Related papers (2022-12-14T16:48:21Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.