Related papers: HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

URL: http://arxiv.org/abs/2511.20520v1
Date: Tue, 25 Nov 2025 17:23:38 GMT
Title: HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
Authors: Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang,
Abstract summary: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models)<n>In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors.<n> Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge.
Score: 72.69742127579508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

Related papers

Prism: Spectral Parameter Sharing for Multi-Agent Reinforcement Learning [2.504298819189614]
We propose Prism, a parameter sharing framework that induces inter-agent diversity by representing shared networks in the spectral domain via singular value decomposition (SVD)<n>Experiments on both homogeneous (LBF, SMACv2) and heterogeneous benchmarks show that Prism achieves competitive performance with superior resource efficiency.
arXiv Detail & Related papers (2026-02-06T08:05:11Z)
Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling [59.84975924845338]
We analyze hybrid architectures through the lens of memory utilization and overall performance.<n> sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts.<n>We introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities.
arXiv Detail & Related papers (2025-10-30T18:19:52Z)
Multiview Manifold Evidential Fusion for PolSAR Image Classification [51.41332458376411]
We propose a new framework to integrate PolSAR manifold learning and evidence fusion into a unified architecture.<n>Experiments on three real-world PolSAR datasets demonstrate that the proposed method consistently outperforms existing approaches in accuracy, robustness, and interpretability.
arXiv Detail & Related papers (2025-10-13T09:05:51Z)
MCCE: A Framework for Multi-LLM Collaborative Co-Evolution [17.41200156551317]
Multi-objective discrete optimization problems pose significant challenges due to their vast and unstructured spaces.<n>Large language models (LLMs) offer powerful priors and reasoning ability, making them naturals when expert knowledge matters.<n>We introduce Multi-LLM Collaborative Co-evolution, a hybrid framework that unites a frozen closed-source LLM with a lightweight trainable model.
arXiv Detail & Related papers (2025-10-06T10:03:28Z)
Monte Carlo Tree Diffusion with Multiple Experts for Protein Design [50.056670856059014]
We propose MCTD-ME, which integrates masked diffusion models with tree search to enable multi-token planning and efficient exploration.<n>Unlike autoregressive planners, MCTD-ME uses biophysical-enhanced diffusion denoising as the rollout engine.<n>The framework is model-agnostic and applicable beyond inverse folding, including de novo protein engineering and multi-objective molecular generation.
arXiv Detail & Related papers (2025-09-19T09:24:42Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment [10.943104653307294]
Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding.<n>We propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment.
arXiv Detail & Related papers (2025-03-12T14:44:52Z)
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs [45.20965298945085]
This paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routings, and a novel method for merging experts with different architectures.<n>Experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.
arXiv Detail & Related papers (2025-02-03T02:34:46Z)
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [103.05005690990271]
Mixture of insighTful Experts (MoTE) is a novel framework that combines reasoning chains and expert mixtures to improve self-alignments.<n>MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
arXiv Detail & Related papers (2024-05-01T15:06:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.