Act2See: Emergent Active Visual Perception for Video Reasoning
Abstract Overview
Act2See is a supervised fine-tuning framework for video reasoning that enables a vision-language model to actively interleave visual evidence within its chain of thought. Instead of relying solely on static initial frames, the model can issue retrieval calls for additional frames from the source video or generation calls for hypothetical frames when reasoning requires missing or counterfactual visual evidence. The training data are constructed from Gemini 2.5 Pro reasoning traces and filtered against human-annotated chains of thought from MINERVA, CausalVQA, and Social Genome; the final SFT dataset contains 3,373 traces, with 47.67% including retrieved or generated frames. Fine-tuned from Qwen3-VL-8B-Thinking using LoRA, the resulting model is evaluated on five video reasoning benchmarks and improves consistently over its base model and several similarly sized open-source baselines.
Novelty
The distinctive contribution is enabling active visual perception through supervised fine-tuning on interleaved text-frame reasoning traces that include specialized retrieval and generation tool calls, rather than relying on static frame inputs or inference-time keyframe insertion alone. The framework combines two forms of visual evidence acquisition within the chain of thought—retrieving real frames at higher sampling rates and generating hypothetical ones via conditional image generation—which is particularly relevant for counterfactual video reasoning scenarios.
Results
Act2See improves over Qwen3-VL-8B-Thinking on all five reported benchmarks: from 71.8 to 74.2 on Video-MME, 41.5 to 46.8 on VideoEspresso, 48.9 to 51.3 on EgoNormia, 38.2 to 47.1 on VCR-Bench, and 60.2 to 63.3 on ViTIB. The paper also reports gains over inference-only interleaving (ViTCoT) and RL-based interleaving baselines (ReWatch-R1, FrameMind), and ablation studies confirm that both retrieval and generation contribute to performance, and that human-annotated CoT sources significantly outperform VLM-generated ones.
Key Points
- Act2See trains a VLM to insert retrieval or generation tool calls inside chain-of-thought reasoning, enabling active acquisition of visual evidence during video QA, with retrieval searching the video at higher sampling rates and generation producing hypothetical frames via conditional image generation.
- The SFT dataset is constructed from Gemini 2.5 Pro traces filtered using BGE M3-Embedding similarity (≥80%) against human-annotated reasoning data from MINERVA, CausalVQA, and Social Genome, yielding 3,373 high-quality examples with 47.67% containing retrieved or generated frames.
- Empirically, the method outperforms its base model across five benchmarks and shows advantages over inference-only (ViTCoT) and RL-based (ReWatch-R1, FrameMind) video-text interleaving approaches, while ablations confirm that combining retrieval and generation outperforms using either alone.