UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
- URL: http://arxiv.org/abs/2506.17202v1
- Date: Fri, 20 Jun 2025 17:52:31 GMT
- Title: UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
- Authors: Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao,
- Abstract summary: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence.<n>We study the modality alignment behaviors of task-specific expert models for understanding and generation.<n>We introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference.
- Score: 39.921363034430875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.
Related papers
- Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts [11.307588007047407]
multimodal large language models (MLLMs) integrate both understanding and generation tasks within a single framework.<n> intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges.<n>We propose a novel approach that decouples internal components of AR to resolve task objective conflicts.
arXiv Detail & Related papers (2025-06-04T05:44:21Z) - Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation [50.22361866757033]
unified vision-language models (VLMs) integrate both visual understanding and generation capabilities.<n>This paper systematically investigates the generalization across understanding and generation tasks in unifiedVLMs.
arXiv Detail & Related papers (2025-05-29T03:40:21Z) - VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning [68.98988753763666]
We propose VisualCloze, a universal image generation framework.<n>VisualCloze supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation.<n>We introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge.
arXiv Detail & Related papers (2025-04-10T17:59:42Z) - A representational framework for learning and encoding structurally enriched trajectories in complex agent environments [1.904851064759821]
The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios.<n>One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them, such as disentangled representations that exploit symmetries.<n>We propose to enrich the agent's ontology and extend the traditionalisation of trajectories to provide a more nuanced view of task execution.
arXiv Detail & Related papers (2025-03-17T14:04:27Z) - Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [72.10987117380584]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.<n>We find existing methods discard task-specific information that, while causing conflicts, is crucial for performance.<n>Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z) - Interpretable deformable image registration: A geometric deep learning perspective [9.13809412085203]
We present a theoretical foundation for designing an interpretable registration framework.<n>We formulate an end-to-end process that refines transformations in a coarse-to-fine fashion.<n>We conclude by showing significant improvement in performance metrics over state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-17T19:47:10Z) - A System for Morphology-Task Generalization via Unified Representation
and Behavior Distillation [28.041319351752485]
In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data.
We introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation.
We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations.
arXiv Detail & Related papers (2022-11-25T18:52:48Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z) - Adversarial Continual Learning [99.56738010842301]
We propose a hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features.
Our model combines architecture growth to prevent forgetting of task-specific skills and an experience replay approach to preserve shared skills.
arXiv Detail & Related papers (2020-03-21T02:08:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.