Related papers: Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

URL: http://arxiv.org/abs/2601.20796v1
Date: Wed, 28 Jan 2026 17:37:28 GMT
Title: Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers
Authors: Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata,
Abstract summary: We investigate how transformers learn to associate information across modalities from in-context examples.<n>We revisit core principles of unimodal ICL in modern transformers.<n>Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars.
Score: 59.472505916020936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.

Related papers

Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence [28.260455480198047]
Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context.<n>We experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training.
arXiv Detail & Related papers (2025-05-22T13:59:30Z)
ModRWKV: Transformer Multimodality in Linear Time [6.022857751402095]
This work explores the capabilities of modern RNN architectures in multimodal contexts.<n>We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture.
arXiv Detail & Related papers (2025-05-20T15:34:36Z)
Can Transformers Learn Full Bayesian Inference in Context? [13.479322264788367]
In this paper, we demonstrate that transformers can perform full Bayesian inference for commonly used statistical models in context.<n>We introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows.<n>Experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods.
arXiv Detail & Related papers (2025-01-28T10:04:53Z)
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution. We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture. We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z)
UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy. The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.