Related papers: Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

URL: http://arxiv.org/abs/2507.16663v2
Date: Thu, 25 Sep 2025 11:17:06 GMT
Title: Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs
Authors: Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, Difan Zou,
Abstract summary: We show that unified MLLMs exhibit an internal gap with understanding outperforming generation.<n>This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework.<n>We empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training.
Score: 46.43090277452948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt-aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self-improvement: progressively enhanced understanding and generation revisit samples underutilized by pre-trained MLLMs, dynamically expanding post-training data and leading to improved performance and unification.

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs [6.696390269864987]
Implicit Discourse Relation Recognition (IDRR) remains a challenging task due to the requirement for deep semantic understanding.<n>Recent advances in large language models (LLMs) have shown strong reasoning capabilities in both deep language understanding and natural language explanation generation.<n>We propose a simple yet effective approach to distill the reasoning capabilities of LLMs into lightweight IDRR models to improve both performance and interpretability.
arXiv Detail & Related papers (2026-02-25T10:28:45Z)
Learning to Self-Verify Makes Language Models Better Reasoners [65.75109817173315]
Large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks.<n>Despite powerful generation ability, LLMs remain weak at verifying their own answers.<n>We show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification.
arXiv Detail & Related papers (2026-02-07T15:49:06Z)
Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models [23.128973540926552]
Endogenous Reprompting transforms the model's understanding into an explicit generative reasoning step.<n>We show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality.
arXiv Detail & Related papers (2026-01-28T06:54:36Z)
LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics [10.638045151201084]
We present a principled taxonomy of twelve recent stateful unlearning methods.<n>We revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob)<n>Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective.
arXiv Detail & Related papers (2025-10-08T23:47:05Z)
Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation [50.22361866757033]
unified vision-language models (VLMs) integrate both visual understanding and generation capabilities.<n>This paper systematically investigates the generalization across understanding and generation tasks in unifiedVLMs.
arXiv Detail & Related papers (2025-05-29T03:40:21Z)
Can Large Reasoning Models Self-Train? [58.953117118687096]
Scaling the performance of large language models increasingly depends on methods that reduce reliance on human supervision.<n>We propose an online self-training reinforcement learning algorithm that leverages the model's self-consistency to infer correctness signals and train without any ground-truth supervision.
arXiv Detail & Related papers (2025-05-27T17:16:00Z)
Incentivizing Truthful Language Models via Peer Elicitation Games [10.530016288072506]
Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations.<n>We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models.
arXiv Detail & Related papers (2025-05-19T18:16:58Z)
ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation [91.20492150248106]
We investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases.<n>We propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs.<n> Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory.
arXiv Detail & Related papers (2025-02-21T15:50:41Z)
Unpacking the Resilience of SNLI Contradiction Examples to Attacks [0.38366697175402226]
We apply the Universal Adversarial Attack to examine the model's vulnerabilities.<n>Our analysis revealed substantial drops in accuracy for the entailment and neutral classes.<n>Fine-tuning the model on an augmented dataset with adversarial examples restored its performance to near-baseline levels.
arXiv Detail & Related papers (2024-12-15T12:47:28Z)
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z)
Diffusing States and Matching Scores: A New Framework for Imitation Learning [16.941612670582522]
Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function.<n> diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression.<n>We show our approach outperforms both GAN-style imitation learning baselines and discriminator-free imitation learning baselines across various continuous control problems.
arXiv Detail & Related papers (2024-10-17T17:59:25Z)
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept [36.27550578296276]
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. In intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. We show that intrinsic self-correction can be progressively improved, allowing it to approach a converged state.
arXiv Detail & Related papers (2024-06-04T14:55:43Z)
A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration [56.64703901898937]
We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training. Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields less repetitive texts.
arXiv Detail & Related papers (2022-05-05T08:50:50Z)
Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap [64.60460828425502]
We propose a new guarantee on the downstream performance of contrastive learning. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations. We propose an unsupervised model selection metric ARC that aligns well with downstream accuracy.
arXiv Detail & Related papers (2022-03-25T05:36:26Z)
Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration [83.29856873525674]
We introduce a lightweight latent variable model UOTA, targeting the view sampling issue for self-supervised learning. Our method directly generalizes to many mainstream self-supervised learning approaches.
arXiv Detail & Related papers (2021-12-15T14:05:23Z)
Solving Inefficiency of Self-supervised Representation Learning [87.30876679780532]
Existing contrastive learning methods suffer from very low learning efficiency. Under-clustering and over-clustering problems are major obstacles to learning efficiency. We propose a novel self-supervised learning framework using a median triplet loss.
arXiv Detail & Related papers (2021-04-18T07:47:10Z)
Bridging the Imitation Gap by Adaptive Insubordination [88.35564081175642]
We show that when the teaching agent makes decisions with access to privileged information, this information is marginalized during imitation learning. We propose 'Adaptive Insubordination' (ADVISOR) to address this gap. ADVISOR dynamically weights imitation and reward-based reinforcement learning losses during training, enabling on-the-fly switching between imitation and exploration.
arXiv Detail & Related papers (2020-07-23T17:59:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.