Teaching Metric Distance to Autoregressive Multimodal Foundational Models
- URL: http://arxiv.org/abs/2503.02379v1
- Date: Tue, 04 Mar 2025 08:14:51 GMT
- Title: Teaching Metric Distance to Autoregressive Multimodal Foundational Models
- Authors: Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu,
- Abstract summary: We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models.<n>DIST2Loss transforms exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets.<n> Empirical evaluations show consistent performance gains in diverse multimodal applications.
- Score: 21.894600900013316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are pronounced in cases of limited training data, highlighting DIST2Loss's effectiveness in resource-constrained settings.
Related papers
- BBQRec: Behavior-Bind Quantization for Multi-Modal Sequential Recommendation [15.818669767036592]
We propose a Behavior-Bind multi-modal Quantization for Sequential Recommendation (BBQRec) featuring dual-aligned quantization and semantics-aware sequence modeling.
BBQRec disentangles modality-agnostic behavioral patterns from noisy modality-specific features through contrastive codebook learning.
We design a discretized similarity reweighting mechanism that dynamically adjusts self-attention scores using quantized semantic relationships.
arXiv Detail & Related papers (2025-04-09T07:19:48Z) - Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space.<n>LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models [42.17166746027585]
We introduce a bidirectional weighted graph-based framework to learn factorized attributes and their interrelations within complex data.
Specifically, we propose a $beta$-VAE based module to extract factors as the initial nodes of the graph.
By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement.
arXiv Detail & Related papers (2024-07-26T15:32:21Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Modular and On-demand Bias Mitigation with Attribute-Removal Subnetworks [10.748627178113418]
We propose a novel modular bias mitigation approach, consisting of stand-alone highly sparse debiasingworks.
We conduct experiments on three classification tasks with gender, race, and age as protected attributes.
arXiv Detail & Related papers (2022-05-30T15:21:25Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.