Related papers: From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

URL: http://arxiv.org/abs/2508.00955v1
Date: Fri, 01 Aug 2025 07:31:24 GMT
Title: From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model
Authors: Yeong-Joon Ju, Seong-Whan Lee,
Abstract summary: Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks.<n>But adapting their generative nature for discriminative representation learning remains a significant challenge.<n>We propose an efficient framework for universal multimodal embeddings, which bridges the gap by centering on two synergistic components.
Score: 29.879983760203256
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.

Related papers

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning [54.73049408950049]
We propose a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning.<n>Our approach improves unified multimodal retrieval from both structural and learning perspectives.
arXiv Detail & Related papers (2025-07-10T16:47:25Z)
Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying [7.9925771591348065]
Core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs.<n>In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples.<n>We propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings.
arXiv Detail & Related papers (2025-05-28T11:18:19Z)
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs [28.20725794099928]
We present UniME, a novel framework that learns discriminative representations for diverse downstream tasks.<n>In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model.<n>In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning.
arXiv Detail & Related papers (2025-04-24T10:51:52Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study [8.827173113748701]
We study character- and word-level edits of task-specific instructions, which substantially degrade downstream performance.<n>We find that, on average, self-denoising achieves substantially higher performance gains than alternative strategies.
arXiv Detail & Related papers (2025-04-03T16:17:56Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning [45.25602203155762]
Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. A major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression. We propose a novel model-agnostic Multistage Contrastive Learning framework.
arXiv Detail & Related papers (2024-02-19T04:13:33Z)
Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training. In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics. Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
Task-Feature Collaborative Learning with Application to Personalized Attribute Prediction [166.87111665908333]
We propose a novel multi-task learning method called Task-Feature Collaborative Learning (TFCL) Specifically, we first propose a base model with a heterogeneous block-diagonal structure regularizer to leverage the collaborative grouping of features and tasks. As a practical extension, we extend the base model by allowing overlapping features and differentiating the hard tasks.
arXiv Detail & Related papers (2020-04-29T02:32:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.