Related papers: WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

URL: http://arxiv.org/abs/2512.02536v1
Date: Tue, 02 Dec 2025 09:02:20 GMT
Title: WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Authors: Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha,
Abstract summary: We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
Score: 69.97021957331326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Related papers

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
A Survey on Diffusion Language Models [30.00199970146068]
Diffusion Language Models (DLMs) are an alternative to the dominant autoregressive (AR) paradigm.<n>DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context.<n>Recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts.
arXiv Detail & Related papers (2025-08-14T17:47:22Z)
CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning [24.981279071712173]
We introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks.<n>Our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks.
arXiv Detail & Related papers (2025-03-25T17:57:17Z)
Enhanced Continual Learning of Vision-Language Models with Model Fusion [16.764069327701186]
Vision-Language Models (VLMs) represent a breakthrough in artificial intelligence.<n>VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks.<n>We propose Continual Decoupling-Unifying (ConDU), a novel approach, by introducing model fusion into continual learning.
arXiv Detail & Related papers (2025-03-12T15:48:13Z)
LFTR: Learning-Free Token Reduction for Multimodal Large Language Models [3.368594680297987]
We introduce a learning-free token reduction (LFTR) method designed for Multimodal Large Language Models (MLLMs)<n>By capitalizing on the redundancy in visual representations, our approach effectively reduces tokens while preserving the general inference performance of MLLMs.<n>Our results show that LFTR achieves up to a $16times$ reduction of visual tokens while maintaining or even enhancing performance on mainstream vision question-answering benchmarks.
arXiv Detail & Related papers (2025-01-29T02:52:32Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z)
MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs) We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.