Related papers: Learning Multimodal Latent Space with EBM Prior and MCMC Inference

Related papers

Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation [22.722731231389393]
Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) Large Language Models (LLMs)<n>We propose a Multimodal Chain-of-Student Reasoning Distillation model, MulCoT-RD, to address deployment constraints in resource-limited environments.<n>Experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.
arXiv Detail & Related papers (2025-08-07T10:23:14Z)
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey [124.23247710880008]
multimodal CoT (MCoT) reasoning has recently garnered significant research attention. Existing MCoT studies design various methodologies to address the challenges of image, video, speech, audio, 3D, and structured data. We present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions.
arXiv Detail & Related papers (2025-03-16T18:39:13Z)
The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models [31.81567038783558]
Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences. MM-RMs often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations. We introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples.
arXiv Detail & Related papers (2025-03-05T02:37:41Z)
Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM) Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information. We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z)
Learning Multimodal Latent Generative Models with Energy-Based Prior [3.6648642834198797]
We propose a novel framework that integrates the latent generative model with the EBM. This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities.
arXiv Detail & Related papers (2024-09-30T01:38:26Z)
Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations. We introduce a predictive self-attention module to capture reliable context dynamics within modalities. A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities. A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z)
Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z)
Learning Energy-Based Prior Model with Diffusion-Amortized MCMC [89.95629196907082]
Common practice of learning latent space EBMs with non-convergent short-run MCMC for prior and posterior sampling is hindering the model from further progress. We introduce a simple but effective diffusion-based amortization method for long-run MCMC sampling and develop a novel learning algorithm for the latent space EBM based on it.
arXiv Detail & Related papers (2023-10-05T00:23:34Z)
Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood [64.95663299945171]
Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming. There exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models. We propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs.
arXiv Detail & Related papers (2023-09-10T22:05:24Z)
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration. Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z)
Chain-of-Thought Prompt Distillation for Multimodal Named Entity Recognition and Multimodal Relation Extraction [8.169359626365619]
We generate a textitchain of thought (CoT) -- a sequence of intermediate reasoning steps. We present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from large language models. Our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2023-06-25T04:33:56Z)
Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm. Despite its widespread use, theoretical justifications in this field are still notably lacking. This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective. A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z)
UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning [29.237813880311943]
We propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Experimental results with fused features on two image-text classification benchmarks show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods.
arXiv Detail & Related papers (2023-05-16T09:18:38Z)
MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC [110.02001052791353]
Learning energy-based model (EBM) requires MCMC sampling of the learned model as an inner loop of the learning algorithm. We show that the model has a particularly simple form in the space of the latent variables of the backbone model.
arXiv Detail & Related papers (2020-06-12T01:25:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.