Related papers: Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

URL: http://arxiv.org/abs/2405.13459v3
Date: Mon, 17 Feb 2025 06:46:11 GMT
Title: Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards
Authors: Xiaoyu Yang, Jie Lu, En Yu,
Abstract summary: Multi-modal Large Language Models (MLLMs) frequently face challenges from concept drift when dealing with real-world streaming data.<n>We propose a unified framework that extends concept drift theory to the multi-modal domain.<n>A T-distribution based drift adapter is proposed to effectively mitigate the bias induced by the gradual drift.
Score: 16.97188816362991
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multi-modal Large Language Models (MLLMs) frequently face challenges from concept drift when dealing with real-world streaming data, wherein distributions change unpredictably. This mainly includes gradual drift due to long-tailed data and sudden drift from Out-Of-Distribution (OOD) data, both of which have increasingly drawn the attention of the research community. While these issues have been extensively studied in the individual domain of vision or language, their impacts on MLLMs in concept drift settings remain largely underexplored. In this paper, we reveal the susceptibility and vulnerability of Vision-Language (VL) models to significant biases arising from gradual drift and sudden drift, particularly in the pre-training. To effectively address these challenges, we propose a unified framework that extends concept drift theory to the multi-modal domain, enhancing the adaptability of the VL model to unpredictable distribution changes. Additionally, a T-distribution based drift adapter is proposed to effectively mitigate the bias induced by the gradual drift, which also facilitates the model in distinguishing sudden distribution changes through explicit distribution modeling. Extensive experiments demonstrate our method enhances the efficiency and accuracy of image-text alignment in the pre-training of VL models, particularly in the concept drift scenario. Moreover, various downstream tasks exhibit significant improvements in our model's ability to adapt to the long-tailed open world. Furthermore, we create a set of multi-modal datasets called OpenMMlo, specifically tailored for the long-tailed open-world setting, to validate our findings. To foster the development of the multi-modal community, we have made both OpenMMlo datasets and our code publicly available at: https://github.com/XiaoyuYoung/ConceptDriftMLLMs.

Related papers

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
Lyapunov-Stable Adaptive Control for Multimodal Concept Drift [1.4864895279988264]
This paper introduces LS-OGD, a novel adaptive control framework for robust multimodal learning in the presence of concept drift.<n>Under bounded drift conditions, the LS-OGD system's prediction error is uniformly ultimately bounded and converges to zero if the drift ceases.
arXiv Detail & Related papers (2025-10-09T18:55:26Z)
Consistent World Models via Foresight Diffusion [56.45012929930605]
We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability.<n>We propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising.
arXiv Detail & Related papers (2025-05-22T10:01:59Z)
Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models [54.196385799229006]
This survey provides the first comprehensive review of advances from traditional approaches to foundation models.<n>It covers: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models.
arXiv Detail & Related papers (2025-01-30T18:59:36Z)
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z)
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
Evolving Multi-Scale Normalization for Time Series Forecasting under Distribution Shifts [20.02869280775877]
We propose a novel model-agnostic Evolving Multi-Scale Normalization (EvoMSN) framework to tackle the distribution shift problem. We evaluate the effectiveness of EvoMSN in improving the performance of five mainstream forecasting methods on benchmark datasets.
arXiv Detail & Related papers (2024-09-29T14:26:22Z)
FFHFlow: A Flow-based Variational Approach for Multi-fingered Grasp Synthesis in Real Time [19.308304984645684]
We propose exploiting a special kind of Deep Generative Model (DGM) based on Normalizing Flows (NFs) We first observed an encouraging improvement in diversity by directly applying a single conditional NFs (cNFs) to learn a grasp distribution conditioned on the incomplete point cloud. This motivated us to develop a novel flow-based d Deep Latent Variable Model (DLVM) Unlike Variational Autoencoders (VAEs), the proposed DLVM counteracts typical pitfalls by leveraging two cNFs for the prior and likelihood distributions.
arXiv Detail & Related papers (2024-07-21T13:33:08Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Balanced Multi-modal Federated Learning via Cross-Modal Infiltration [19.513099949266156]
Federated learning (FL) underpins advancements in privacy-preserving distributed computing. We propose a novel Cross-Modal Infiltration Federated Learning (FedCMI) framework.
arXiv Detail & Related papers (2023-12-31T05:50:15Z)
MemDA: Forecasting Urban Time Series with Memory-based Drift Adaptation [24.284969264008733]
We propose a new urban time series prediction model for the concept drift problem, which encodes the drift by considering the periodicity in the data. Our design significantly outperforms state-of-the-art methods and can be well generalized to existing prediction backbones.
arXiv Detail & Related papers (2023-09-25T15:22:28Z)
Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Distributional Drift Adaptation with Temporal Conditional Variational Autoencoder for Multivariate Time Series Forecasting [41.206310481507565]
We propose a novel framework temporal conditional variational autoencoder (TCVAE) to model the dynamic distributional dependencies over time. The TCVAE infers the dependencies as a temporal conditional distribution to leverage latent variables. We show the TCVAE's superior robustness and effectiveness over the state-of-the-art MTS forecasting baselines.
arXiv Detail & Related papers (2022-09-01T10:06:22Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.