Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
- URL: http://arxiv.org/abs/2506.06970v1
- Date: Sun, 08 Jun 2025 02:33:35 GMT
- Title: Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
- Authors: Pengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, Sifeng He,
- Abstract summary: We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that guides cross modal representation learning.<n>MaPLE formulates the learning process as reinforcement learning with two key components: automatic preference data construction using off-the-shelf MLLM, and a new Relative Preference Alignment (RPA) loss.<n> Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval.
- Score: 11.460393501694021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.
Related papers
- FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation [43.56253799373878]
We introduce FuDoBa, a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge.<n>This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights.<n>We demonstrate the effectiveness of our approach on six datasets in two domains, showing that our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
arXiv Detail & Related papers (2025-07-09T07:49:55Z) - Alignment of large language models with constrained learning [93.2264691508005]
We study the problem of computing an optimal large language model (LLM) policy for a constrained alignment problem.<n>We employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the policy via Lagrangian and updating a dual variable via dual descent.
arXiv Detail & Related papers (2025-05-26T01:04:56Z) - Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions.<n>Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness.<n>We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [86.21199607040147]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation language models.<n>We introduce Chain-of-Description, a step-by-step visual understanding method, and integrate structured chain-of-thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Extensive experiments demonstrate that SIcog produces next-generation foundation MLLMs with substantially improved multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models [37.476241509187304]
Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data.<n>The lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications.<n>We introduce an efficient selective layer intervention based on prominent parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2024-10-23T09:40:15Z) - Imitating Language via Scalable Inverse Reinforcement Learning [34.161807103808016]
We focus on investigating the inverse reinforcement learning perspective to imitation.<n>We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance.
arXiv Detail & Related papers (2024-09-02T16:48:57Z) - CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.