Related papers: Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

URL: http://arxiv.org/abs/2506.03591v1
Date: Wed, 04 Jun 2025 05:44:21 GMT
Title: Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts
Authors: Jiaxing Zhang, Xinyi Zeng, Hao Tang,
Abstract summary: multimodal large language models (MLLMs) integrate both understanding and generation tasks within a single framework.<n> intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges.<n>We propose a novel approach that decouples internal components of AR to resolve task objective conflicts.
Score: 11.307588007047407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

Related papers

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation [39.921363034430875]
Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence.<n>We study the modality alignment behaviors of task-specific expert models for understanding and generation.<n>We introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference.
arXiv Detail & Related papers (2025-06-20T17:52:31Z)
Towards Unified Modeling in Federated Multi-Task Learning via Subspace Decoupling [23.642760378344335]
Federated Multi-Task Learning (FMTL) enables multiple clients performing heterogeneous tasks without exchanging their local data.<n>Most existing FMTL methods focus on building personalized models for each client and unable to support the aggregation of multiple heterogeneous tasks into a unified model.<n>We propose FedDEA, an update-structure-aware aggregation method specifically designed for multi-task model integration.
arXiv Detail & Related papers (2025-05-30T03:53:21Z)
ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation [73.18867725540865]
Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models.<n>We propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework.
arXiv Detail & Related papers (2025-05-24T11:01:45Z)
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [72.10987117380584]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.<n>We find existing methods discard task-specific information that, while causing conflicts, is crucial for performance.<n>Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z)
Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge. Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition. We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z)
InterroGate: Learning to Share, Specialize, and Prune Representations for Multi-task Learning [17.66308231838553]
We propose a novel multi-task learning (MTL) architecture designed to mitigate task interference while optimizing inference computational efficiency. We employ a learnable gating mechanism to automatically balance the shared and task-specific representations while preserving the performance of all tasks.
arXiv Detail & Related papers (2024-02-26T18:59:52Z)
Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion [86.6191592951269]
Merging models fine-tuned from common extensively pretrained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multitask model that performs well across diverse tasks. We propose the CONtinuous relaxation dis (Concrete) subspace learning method to identify a common lowdimensional subspace and utilize its shared information track interference problem without sacrificing performance.
arXiv Detail & Related papers (2023-12-11T07:24:54Z)
Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning [29.14234496784581]
We propose Contrastive Modules with Temporal Attention(CMTA) method for multi-task reinforcement learning. CMTA constrains the modules to be different from each other by contrastive learning and combining shared modules at a finer granularity than the task level. Experimental results show that CMTA outperforms learning each task individually for the first time and achieves substantial performance improvements.
arXiv Detail & Related papers (2023-11-02T08:41:00Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making [19.87916700767421]
Vision language decision making (VLDM) is a challenging multimodal task. From an environment perspective, we find that task episodes can be divided into fine-grained textitunits We propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias.
arXiv Detail & Related papers (2023-07-16T11:54:16Z)
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.