UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
- URL: http://arxiv.org/abs/2511.19413v2
- Date: Wed, 26 Nov 2025 05:12:27 GMT
- Title: UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
- Authors: Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang,
- Abstract summary: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture.<n>We present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies.
- Score: 21.728770994708402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
Related papers
- UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model [50.68870074090426]
We introduce UniWeTok, a unified discrete tokenizer for Unified Multimodal Large Language Models.<n>For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens.<n>We propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios.
arXiv Detail & Related papers (2026-02-15T15:07:19Z) - UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision [34.575729271291436]
We formalize Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis.<n>We propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision.<n>To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop.
arXiv Detail & Related papers (2026-01-06T17:15:50Z) - SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z) - Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark [69.8473923357969]
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
arXiv Detail & Related papers (2025-10-15T17:10:35Z) - Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models [27.38501052629525]
We propose Uni-X, a two-end-separated, middle-shared architecture for unified multimodal models.<n>Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion.<n>Our results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling.
arXiv Detail & Related papers (2025-09-29T07:05:10Z) - Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning [12.528410977116438]
Multiview learning on Boolean circuits holds immense promise, as different graph-based representations offer complementary structural and semantic information.<n>We introduce MixGate, a framework built on a principled training curriculum that teaches the model a shared, function-aware representation space.<n>We show that our alignment-first strategy transforms masked modeling from an ineffective technique into a powerful performance driver.
arXiv Detail & Related papers (2025-09-25T10:12:04Z) - UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z) - Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition [18.582459363950907]
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR)<n>We introduce Uni-MuMER, which fully fine-tunes a vision-language model for the HMER task without modifying its architecture.<n>Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.
arXiv Detail & Related papers (2025-05-29T15:41:00Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Combating Exacerbated Heterogeneity for Robust Models in Federated
Learning [91.88122934924435]
Combination of adversarial training and federated learning can lead to the undesired robustness deterioration.
We propose a novel framework called Slack Federated Adversarial Training (SFAT)
We verify the rationality and effectiveness of SFAT on various benchmarked and real-world datasets.
arXiv Detail & Related papers (2023-03-01T06:16:15Z) - Learning Target-aware Representation for Visual Tracking via Informative
Interactions [49.552877881662475]
We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking.
The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
arXiv Detail & Related papers (2022-01-07T16:22:27Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.