Related papers: Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

URL: http://arxiv.org/abs/2602.05729v1
Date: Thu, 05 Feb 2026 14:52:35 GMT
Title: Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification
Authors: Lexiang Hu, Youze Xue, Dian Li, Gang Liu, Zhouchen Lin,
Abstract summary: Multimodal embeddings serve as a bridge for aligning vision and language.<n>We propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings.<n>AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding.
Score: 49.109117617514066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.

Related papers

DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation [13.114773060703891]
We propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR)<n>For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs.<n>For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics.
arXiv Detail & Related papers (2026-02-14T10:42:56Z)
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
ReMatch: Boosting Representation through Matching for Multimodal Retrieval [29.610030065465793]
ReMatch is a framework that leverages the generative strength of MLLMs for multimodal retrieval.<n>We train the embedding MLLM end-to-end with a chat-style generative matching stage.<n>Our experiments show particularly strong zero-shot generalization results on five datasets.
arXiv Detail & Related papers (2025-11-24T16:28:49Z)
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning [101.62386137855704]
We present a novel Universal Multimodal Embedding (UniME-V2) model.<n>Our approach first constructs a potential hard negative set through global retrieval.<n>We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs.<n>These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives.
arXiv Detail & Related papers (2025-10-15T13:07:00Z)
Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z)
Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions.<n>Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness.<n>We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z)
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance [47.53085562765585]
We introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model.<n>To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer.<n>To promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme.
arXiv Detail & Related papers (2024-12-09T17:11:50Z)
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [56.08867996209236]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.<n>We introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios.<n>We develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.