LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
- URL: http://arxiv.org/abs/2503.04812v1
- Date: Tue, 04 Mar 2025 10:21:57 GMT
- Title: LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
- Authors: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su,
- Abstract summary: We propose a framework that dynamically improves the embedding model's representation learning for negative pairs.<n>LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance.<n>LLaVE can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance.
- Score: 76.82159851648711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
Related papers
- LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning [39.54891426369773]
Trade-offs between model size, architecture, and performance remain underexplored.
In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones.
To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures.
arXiv Detail & Related papers (2025-03-19T18:10:12Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.<n>We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.<n>Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing [37.400757839157116]
Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles.<n>We propose relevance paraphrasing, a simple strategy that can be used to measure the robustness of LLMs as summarizers.
arXiv Detail & Related papers (2024-06-06T12:08:43Z) - NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models [38.41524186248607]
We introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets.
For model architecture, we propose a latent attention layer to obtain pooled embeddings.
For training algorithm, we introduce a two-stage contrastive instruction-tuning method.
arXiv Detail & Related papers (2024-05-27T17:59:45Z) - Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration [41.714059733079374]
Vision-Language Models (VLMs) integrate information from multiple modalities and have shown remarkable success across various tasks.
This study addresses two key questions: how to distribute sparsity across different modality-specific models, and how to restore the performance of pruned sparse VLMs.
arXiv Detail & Related papers (2024-04-03T03:27:01Z) - MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [49.32669226551026]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.<n>MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.<n>Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z) - Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs)
Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.