UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation
- URL: http://arxiv.org/abs/2601.11522v1
- Date: Fri, 16 Jan 2026 18:59:58 GMT
- Title: UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation
- Authors: Ruiheng Zhang, Jingfeng Yao, Huangxuan Zhao, Hao Yan, Xiao He, Lei Chen, Zhou Wei, Yong Luo, Zengmao Wang, Lefei Zhang, Dacheng Tao, Bo Du,
- Abstract summary: We present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation.<n>UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation.<n>On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance and a 24.2% gain in generation quality.
- Score: 98.93314262366681
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.
Related papers
- Vision-Language Controlled Deep Unfolding for Joint Medical Image Restoration and Segmentation [34.04441838578788]
We propose a principled framework for joint All-in-One Medical Image Restoration and (AiOMIRS)<n>We introduce a frequency-aware Mamba mechanism to capture long-range dependencies for global segmentation while preserving the high-frequency textures necessary for restoration.<n>As a pioneering work in the AiOMIRS task, VL-DUN establishes a new state-of-the-art across multi-modal benchmarks, improving PSNR by 0.92 dB and the Dice coefficient by 9.76%.
arXiv Detail & Related papers (2026-01-30T15:48:35Z) - GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning [54.42973725693]
We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model.<n>GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ and WISE.<n>Our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks.
arXiv Detail & Related papers (2026-01-26T14:49:04Z) - UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation [39.921363034430875]
Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence.<n>We study the modality alignment behaviors of task-specific expert models for understanding and generation.<n>We introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference.
arXiv Detail & Related papers (2025-06-20T17:52:31Z) - SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards [55.99492656542475]
We propose textbfSUDER (textbfSelf-improving textbfUnified LMMs with textbfDual stextbfElf-textbfRewards), a framework reinforcing the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z) - DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning [53.27049077100897]
generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding.<n>This work introduces self-conditioning, a mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers.<n>Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures.
arXiv Detail & Related papers (2025-05-16T08:47:16Z) - Unifying Autoregressive and Diffusion-Based Sequence Generation [3.1853022872760186]
We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models.<n>We introduce hyperschedules, which assign distinct noise schedules to individual token positions.<n>Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes.
arXiv Detail & Related papers (2025-04-08T20:32:10Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.