CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
- URL: http://arxiv.org/abs/2602.08339v1
- Date: Mon, 09 Feb 2026 07:26:40 GMT
- Title: CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
- Authors: Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu,
- Abstract summary: CoTZero is an annotation-free paradigm with two components.<n>CoTZero extracts visual primitives and composes them into structured question-reasoning forms.<n>In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards.
- Score: 9.44589968698834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
Related papers
- Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z) - HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning [66.99487505369254]
HiCoGen is built upon a novel Chain of Synthesis paradigm.<n>It decomposes complex prompts into minimal semantic units.<n>It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next.<n>Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
arXiv Detail & Related papers (2025-11-25T06:24:25Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.27548620675748]
Compositional visual reasoning has emerged as a key research frontier in multimodal AI.<n>This survey systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.)<n>We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception.
arXiv Detail & Related papers (2025-08-24T11:01:51Z) - Zero-shot Compositional Action Recognition with Neural Logic Constraints [15.451848952659343]
ZS-CAR aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training.<n>Despite compositional learning's progress, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process.<n>We argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction.
arXiv Detail & Related papers (2025-08-04T11:40:42Z) - Boosting Neural Language Inference via Cascaded Interactive Reasoning [38.125341836302525]
Natural Language Inference (NLI) focuses on ascertaining the logical relationship between a given premise and hypothesis.<n>This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances.<n>We introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI.
arXiv Detail & Related papers (2025-05-10T11:37:15Z) - Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension [31.952192907460713]
Relation-R1 is the textitfirst unified relation comprehension framework.<n>It integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization ( GRPO)<n>Experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and textitN-ary relation understanding.
arXiv Detail & Related papers (2025-04-20T14:50:49Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Synergistic Anchored Contrastive Pre-training for Few-Shot Relation
Extraction [4.7220779071424985]
Few-shot Relation Extraction (FSRE) aims to extract facts from a sparse set of labeled corpora.
Recent studies have shown promising results in FSRE by employing Pre-trained Language Models.
We introduce a novel synergistic anchored contrastive pre-training framework.
arXiv Detail & Related papers (2023-12-19T10:16:24Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.