Related papers: Geodesic Multi-Modal Mixup for Robust Fine-Tuning

Geodesic Multi-Modal Mixup for Robust Fine-Tuning

URL: http://arxiv.org/abs/2203.03897v4
Date: Tue, 7 Nov 2023 00:34:37 GMT
Title: Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Authors: Changdae Oh, Junhyuk So, Hoyoon Byun, YongTaek Lim, Minchul Shin, Jong-June Jeon, Kyungwoo Song
Abstract summary: We show that CLIP retains poor uniformity and alignment even after fine-tuning. We propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples. Our method provides transferable representations, enabling robust model adaptation on diverse tasks.
Score: 21.298732743643168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of uniformity-alignment to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup

Related papers

Modest-Align: Data-Efficient Alignment for Vision-Language Models [67.48633659305592]
Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
arXiv Detail & Related papers (2025-10-24T16:11:10Z)
OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching [14.664226708184676]
Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes.<n>We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware.
arXiv Detail & Related papers (2025-10-10T07:07:19Z)
CM$^3$: Calibrating Multimodal Recommendation [10.09576389984858]
This study revisits the alignment and uniformity properties within the context of multimodal recommender systems.<n>We propose a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold.<n>We also introduce a Spherical B'ezier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold.
arXiv Detail & Related papers (2025-08-02T06:44:59Z)
Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
Conditional Distribution Quantization in Machine Learning [83.54039134248231]
Conditional expectation mathbbE(Y mid X) often fails to capture the complexity of multimodal conditional distributions mathcalL(Y mid X) We propose using n-point conditional quantizations--functional mappings of X that are learnable via gradient descent--to approximate mathcalL(Y mid X)
arXiv Detail & Related papers (2025-02-11T00:28:24Z)
Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models [85.67870425656368]
We introduce a unified causal model specifically designed for multimodal data. We show that multimodal contrastive representation learning excels at identifying latent coupled variables. Experiments demonstrate the robustness of our findings, even when the assumptions are violated.
arXiv Detail & Related papers (2024-02-09T07:18:06Z)
Multi-scale Diffusion Denoised Smoothing [79.95360025953931]
randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale. We present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning enables strong certified robustness available with high noise level.
arXiv Detail & Related papers (2023-10-25T17:11:21Z)
Multi-Head Multi-Loss Model Calibration [13.841172927454204]
We introduce a form of simplified ensembling that bypasses the costly training and inference of deep ensembles. Specifically, each head is trained to minimize a weighted Cross-Entropy loss, but the weights are different among the different branches. We show that the resulting averaged predictions can achieve excellent calibration without sacrificing accuracy in two challenging datasets.
arXiv Detail & Related papers (2023-03-02T09:32:32Z)
Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation. It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression. Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)
Modulated Contrast for Versatile Image Synthesis [60.304183493234376]
MoNCE is a versatile metric that introduces image contrast to learn a calibrated metric for the perception of multifaceted inter-image distances. We introduce optimal transport in MoNCE to modulate the pushing force of negative samples collaboratively across multiple contrastive objectives.
arXiv Detail & Related papers (2022-03-17T14:03:46Z)
Robustness via Uncertainty-aware Cycle Consistency [44.34422859532988]
Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty. We propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC)
arXiv Detail & Related papers (2021-10-24T15:33:21Z)
TRS: Transferability Reduced Ensemble via Encouraging Gradient Diversity and Model Smoothness [14.342349428248887]
Adversarial Transferability is an intriguing property of adversarial examples. This paper theoretically analyzes sufficient conditions for transferability between models. We propose a practical algorithm to reduce transferability within an ensemble to improve its robustness.
arXiv Detail & Related papers (2021-04-01T17:58:35Z)
Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework. Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z)
The Bures Metric for Generative Adversarial Networks [10.69910379275607]
Generative Adversarial Networks (GANs) are performant generative methods yielding high-quality samples. We propose to match the real batch diversity to the fake batch diversity. We observe that diversity matching reduces mode collapse substantially and has a positive effect on the sample quality.
arXiv Detail & Related papers (2020-06-16T12:04:41Z)
Embedding Propagation: Smoother Manifold for Few-Shot Classification [131.81692677836202]
We propose to use embedding propagation as an unsupervised non-parametric regularizer for manifold smoothing in few-shot classification. We empirically show that embedding propagation yields a smoother embedding manifold. We show that embedding propagation consistently improves the accuracy of the models in multiple semi-supervised learning scenarios by up to 16% points.
arXiv Detail & Related papers (2020-03-09T13:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.