FuguReport

UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

Authors Yicheng Xu, Jiangning Zhang, Zhucun Xue, Teng Hu, Ran Yi, Xiaobin Hu, Yong Liu, Dacheng Tao
Affiliations Shanghai Jiaotong University / Nanyang Technological University / Zhejiang University / National University of Singapore
Categories Method / In-context Learning / Unified multimodal in-context learning method, Data / Dataset Construction / Large-scale multimodal corpus building, Method / Model Adaptation / Context-adaptive prototype modulation
License CC BY 4.0

Abstract Overview

This paper addresses unified multimodal in-context learning (ICL) for models that jointly handle understanding and generation, showing that ICL performance is highly sensitive to demonstration choice and can scale non-monotonically with additional shots. The authors introduce a six-level capability-oriented taxonomy—perception, imitation, conception, deduction, analogy, and discernment—that classifies the functional role of demonstrations based on cognitive demands. Guided by this taxonomy, they construct UniICL-760K, a corpus of 766,868 curated 8-shot episodes across 15 subtasks, and UniICL-Bench, a 1,250-episode benchmark for controlled evaluation of shot scaling and stability. They also propose CAPM (Context-Adaptive Prototype Modulator), a lightweight plug-and-play module that disentangles demonstration representations and dynamically adjusts context routing to stabilize few-shot adaptation.

Novelty

The main novelty is the capability-oriented systematization of unified multimodal ICL through a six-level cognitive taxonomy, combined with a large-scale taxonomy-guided training corpus (UniICL-760K) and a benchmark (UniICL-Bench) that explicitly measures shot-scaling behavior and robustness under context perturbations. The paper also introduces CAPM as a lightweight architectural intervention for stabilizing multimodal few-shot adaptation without modifying the full backbone.

Results

On UniICL-Bench, the proposed method achieves the highest peak understanding score (78.9) and ICL efficiency (16.9) among reported unified models, with leading generation-side efficiency as well. Stability experiments show substantially smaller degradation than the BAGEL baseline under random replacement (2.1% vs. 7.1% for understanding), reverse ordering (1.4% vs. 2.8%), and interference perturbations (1.6% vs. 7.9%). A human study over 350 generation episodes against Nexus-Gen-V2 yields a 61.3% overall win rate for UniICL.

Key Points

  1. The paper introduces a six-level taxonomy—perception, imitation, conception, deduction, analogy, and discernment—to categorize the functional role of demonstrations in multimodal ICL, revealing non-monotonic scaling behaviors where additional demonstrations can degrade perception-dominant tasks while improving complex inductive tasks.
  2. UniICL-760K (766,868 episodes across 15 subtasks) and UniICL-Bench (1,250 episodes with controlled perturbation protocols) provide the first cognitively structured training corpus and evaluation suite for unified multimodal in-context learning spanning both understanding and generation.
  3. The CAPM module adds negligible inference overhead (no measurable latency or VRAM increase across injection depths) while improving both few-shot performance and robustness, with ablations confirming that data-driven ICL training provides the dominant gain and CAPM contributes additional improvements on top.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.