Related papers: TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

URL: http://arxiv.org/abs/2603.04772v1
Date: Thu, 05 Mar 2026 03:43:52 GMT
Title: TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings
Authors: Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, Li Li,
Abstract summary: TSEmbed is a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA)<n>We introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity.
Score: 26.532942920392376
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.

Related papers

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings [44.77164359074224]
Multimodal Large Language Models (MLLMs) have become pivotal for advancing Universal Multimodal Embeddings (UME)<n>Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations.<n>We propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT.
arXiv Detail & Related papers (2026-02-14T15:35:03Z)
Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z)
From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z)
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model [49.65930977591188]
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks.<n>We introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design.<n>Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency.<n>The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings.
arXiv Detail & Related papers (2025-10-14T16:43:22Z)
From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model [29.879983760203256]
Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks.<n>But adapting their generative nature for discriminative representation learning remains a significant challenge.<n>We propose an efficient framework for universal multimodal embeddings, which bridges the gap by centering on two synergistic components.
arXiv Detail & Related papers (2025-08-01T07:31:24Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Multimodal Instruction Tuning with Conditional Mixture of LoRA [51.58020580970644]
This paper introduces a novel approach that integrates multimodal instruction tuning with Low-Rank Adaption (LoRA)<n>It innovates upon LoRA by dynamically constructing low-rank adaptation matrices tailored to the unique demands of each input instance.<n> Experimental results on various multimodal evaluation datasets indicate that MixLoRA not only outperforms the conventional LoRA with the same or even higher ranks.
arXiv Detail & Related papers (2024-02-24T20:15:31Z)
SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization [13.456240733175767]
SimMMDG is a framework to overcome the challenges of achieving domain generalization in multi-modal scenarios. We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints. Our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon dataset.
arXiv Detail & Related papers (2023-10-30T17:58:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.