Related papers: Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

URL: http://arxiv.org/abs/2509.17552v2
Date: Wed, 24 Sep 2025 05:24:09 GMT
Title: Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning
Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan,
Abstract summary: We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning.<n>Unlike traditional in-context learning, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning.
Score: 10.887230359930697
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

Related papers

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models [33.822930522694406]
We overview a promising learning paradigm, i.e., Modular Machine Learning (MML), as an essential approach toward new-generation large language models (LLMs)<n>We propose a unified MML framework for LLMs, which decomposes the complex structure of LLMs into three interdependent components: modular representation, modular model, and modular reasoning.<n>Ultimately, we believe the integration of the MML with LLMs has the potential to bridge the gap between statistical (deep) learning and formal (logical) reasoning.
arXiv Detail & Related papers (2025-04-28T17:42:02Z)
When Text Embedding Meets Large Language Model: A Comprehensive Survey [17.263184207651072]
This survey focuses on the interplay between large language models (LLMs) and text embeddings.<n>It offers a novel and systematic overview of contributions from various research and application domains.<n>Building on this analysis, we outline prospective directions for the evolution of text embedding.
arXiv Detail & Related papers (2024-12-12T10:50:26Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z)
Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability. The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z)
Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality. We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs) Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z)
Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus. Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z)
Link-Context Learning for Multimodal LLMs [40.923816691928536]
Link-context learning (LCL) emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL guides the model to discern not only the analogy but also the underlying causal associations between data points. To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset.
arXiv Detail & Related papers (2023-08-15T17:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.