UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy
Abstract Overview
This paper addresses unified multimodal in-context learning (ICL) for models that jointly handle understanding and generation, showing that ICL performance is highly sensitive to demonstration choice and can scale non-monotonically with additional shots. The authors introduce a six-level capability-oriented taxonomy—perception, imitation, conception, deduction, analogy, and discernment—that classifies the functional role of demonstrations based on cognitive demands. Guided by this taxonomy, they construct UniICL-760K, a corpus of 766,868 curated 8-shot episodes across 15 subtasks, and UniICL-Bench, a 1,250-episode benchmark for controlled evaluation of shot scaling and stability. They also propose CAPM (Context-Adaptive Prototype Modulator), a lightweight plug-and-play module that disentangles demonstration representations and dynamically adjusts context routing to stabilize few-shot adaptation.
Novelty
The main novelty is the capability-oriented systematization of unified multimodal ICL through a six-level cognitive taxonomy, combined with a large-scale taxonomy-guided training corpus (UniICL-760K) and a benchmark (UniICL-Bench) that explicitly measures shot-scaling behavior and robustness under context perturbations. The paper also introduces CAPM as a lightweight architectural intervention for stabilizing multimodal few-shot adaptation without modifying the full backbone.
Results
On UniICL-Bench, the proposed method achieves the highest peak understanding score (78.9) and ICL efficiency (16.9) among reported unified models, with leading generation-side efficiency as well. Stability experiments show substantially smaller degradation than the BAGEL baseline under random replacement (2.1% vs. 7.1% for understanding), reverse ordering (1.4% vs. 2.8%), and interference perturbations (1.6% vs. 7.9%). A human study over 350 generation episodes against Nexus-Gen-V2 yields a 61.3% overall win rate for UniICL.
Key Points
- The paper introduces a six-level taxonomy—perception, imitation, conception, deduction, analogy, and discernment—to categorize the functional role of demonstrations in multimodal ICL, revealing non-monotonic scaling behaviors where additional demonstrations can degrade perception-dominant tasks while improving complex inductive tasks.
- UniICL-760K (766,868 episodes across 15 subtasks) and UniICL-Bench (1,250 episodes with controlled perturbation protocols) provide the first cognitively structured training corpus and evaluation suite for unified multimodal in-context learning spanning both understanding and generation.
- The CAPM module adds negligible inference overhead (no measurable latency or VRAM increase across injection depths) while improving both few-shot performance and robustness, with ablations confirming that data-driven ICL training provides the dominant gain and CAPM contributes additional improvements on top.