Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
- URL: http://arxiv.org/abs/2507.16746v1
- Date: Tue, 22 Jul 2025 16:35:36 GMT
- Title: Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
- Authors: Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum,
- Abstract summary: We introduce $textbfZebra-CoT$, a diverse large-scale dataset with 182,384 samples.<n>We focus on four categories of tasks where sketching or visual reasoning is especially natural.<n>Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains.
- Score: 105.25503508433758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
Related papers
- Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning [14.984593408786045]
We propose ReasonBench to evaluate the performance of visual language models (VLMs) in graphic reasoning tasks.<n>ReasonBench includes 1,613 questions from real-world intelligence tests.<n>We benchmark 11 mainstream VLMs and reveal significant limitations of current models.
arXiv Detail & Related papers (2025-08-01T05:12:38Z) - Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data [10.435321637846142]
We introduce textsctextbfCausal3D, a novel benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning.<n>Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds.
arXiv Detail & Related papers (2025-03-06T03:40:01Z) - Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.<n>They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.<n>We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles [29.214813685163218]
Release of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models.<n>We track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles.<n>Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning.
arXiv Detail & Related papers (2025-02-03T05:47:04Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.