Hyperbolic Learning with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2408.05097v1
- Date: Fri, 9 Aug 2024 14:39:15 GMT
- Title: Hyperbolic Learning with Multimodal Large Language Models
- Authors: Paolo Mandica, Luca Franco, Konstantinos Kallidromitis, Suzanne Petryk, Fabio Galasso,
- Abstract summary: We address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture.
We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.
- Score: 8.98815579836401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.
Related papers
- Hyperbolic Deep Learning for Foundation Models: A Survey [16.14776172953206]
Foundation models pre-trained on massive datasets have demonstrated remarkable success in diverse downstream tasks.<n>Recent advances have leveraged hyperbolic neural networks to enhance foundation models.<n>This paper provides a comprehensive review of hyperbolic neural networks and their recent development for foundation models.
arXiv Detail & Related papers (2025-07-23T09:50:17Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts [23.011684464345294]
We introduce HELM, a family of HypErbolic Large Language Models.<n>For HELM-MICE, we develop hyperbolic Multi-Head Latent Attention.<n>For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization.
arXiv Detail & Related papers (2025-05-30T15:42:42Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.
Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.
We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU [50.9588132578029]
This paper investigates machine unlearning in hyperbolic contrastive learning.
We adapt Alignment to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies.
Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space.
arXiv Detail & Related papers (2025-03-19T12:47:37Z) - Teaching Metric Distance to Autoregressive Multimodal Foundational Models [21.894600900013316]
We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models.
DIST2Loss transforms exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets.
Empirical evaluations show consistent performance gains in diverse multimodal applications.
arXiv Detail & Related papers (2025-03-04T08:14:51Z) - Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models [26.656858396343726]
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations.
Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data.
We explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data.
arXiv Detail & Related papers (2025-02-03T17:59:45Z) - RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models [60.596005921295806]
Agglomerative models have emerged as a powerful approach to training vision foundation models.
We identify critical challenges including resolution mode shifts, teacher imbalance, idiosyncratic teacher artifacts, and an excessive number of output tokens.
We propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions.
arXiv Detail & Related papers (2024-12-10T17:06:41Z) - Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z) - Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning [113.89327264634984]
Few-shot class-incremental learning (FSCIL) confronts the challenge of integrating new classes into a model with minimal training samples.
Traditional methods widely adopt static adaptation relying on a fixed parameter space to learn from data that arrive sequentially.
We propose a dual selective SSM projector that dynamically adjusts the projection parameters based on the intermediate features for dynamic adaptation.
arXiv Detail & Related papers (2024-07-08T17:09:39Z) - Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z) - Tasks Makyth Models: Machine Learning Assisted Surrogates for Tipping
Points [0.0]
We present a machine learning (ML)-assisted framework for detecting tipping points in the emergent behavior of complex systems.
We construct reduced-order models for the emergent dynamics at different scales.
We contrast the uses of the different models and the effort involved in learning them.
arXiv Detail & Related papers (2023-09-25T17:58:23Z) - Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based
Tumor Classification [5.121989578393729]
Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning.
Coarse-grained labels are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL)
We propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL)
arXiv Detail & Related papers (2023-07-14T17:06:49Z) - Hyperbolic Representation Learning: Revisiting and Advancing [43.1661098138936]
We introduce a position-tracking mechanism to scrutinize existing prevalent hlms, revealing that the learned representations are sub-optimal and unsatisfactory.
We propose a simple yet effective method, hyperbolic informed embedding (HIE), by incorporating cost-free hierarchical information deduced from the hyperbolic distance of the node to origin.
Our method achieves a remarkable improvement of up to 21.4% compared to the competing baselines.
arXiv Detail & Related papers (2023-06-15T13:25:39Z) - HMSN: Hyperbolic Self-Supervised Learning by Clustering with Ideal
Prototypes [7.665392786787577]
We use hyperbolic representation space for self-supervised representation learning for prototype-based clustering approaches.
We extend the Masked Siamese Networks to operate on the Poincar'e ball model of hyperbolic space.
Unlike previous methods we project to the hyperbolic space at the output of the encoder network and utilise a hyperbolic projection head to ensure that the representations used for downstream tasks remain hyperbolic.
arXiv Detail & Related papers (2023-05-18T12:38:40Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Bi-level Doubly Variational Learning for Energy-based Latent Variable
Models [46.75117861209482]
Energy-based latent variable models (EBLVMs) are more expressive than conventional energy-based models.
We propose Bi-level doubly variational learning (BiDVL) to facilitate learning EBLVMs.
Our model achieves impressive image generation performance over related works.
arXiv Detail & Related papers (2022-03-24T04:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.