TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
- URL: http://arxiv.org/abs/2602.05570v1
- Date: Thu, 05 Feb 2026 11:49:30 GMT
- Title: TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
- Authors: Yikun Zong, Cheston Tan,
- Abstract summary: Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback.<n>However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning.<n>We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes.
- Score: 11.222572150508332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.
Related papers
- ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction [35.24704057622881]
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation.<n>We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction.
arXiv Detail & Related papers (2025-11-26T00:06:02Z) - Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning [52.99434388759101]
We propose a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning.<n>Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair.<n>Our experiments show that Agent0-VL achieves an 12.5% improvement over the base model.
arXiv Detail & Related papers (2025-11-25T04:15:14Z) - Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution [0.0]
We show that effective planning arises from the co-evolution of belief and intent within a minimal set of semantically rich tokens.<n>Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent.
arXiv Detail & Related papers (2025-10-30T12:16:45Z) - Understanding Transformers through the Lens of Pavlovian Conditioning [0.5076419064097734]
We present a novel theoretical framework that reinterprets the core computation of attention as Pavlovian conditioning.<n>We demonstrate that attention's queries, keys, and values can be mapped to the three elements of classical conditioning.<n>Our framework yields several theoretical insights grounded in this linearized model.
arXiv Detail & Related papers (2025-08-05T05:00:00Z) - Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z) - RLSR: Reinforcement Learning from Self Reward [0.0]
We show that large language models can effectively self-improve through self-judging without reference solutions.<n>Our experiments show that models can provide reliable reward signals without ground truth answers.<n>This work represents a significant step toward autonomous AI systems that continuously improve through self-directed learning.
arXiv Detail & Related papers (2025-05-12T23:51:04Z) - Absolute Zero: Reinforced Self-play Reasoning with Zero Data [57.30662797376754]
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models.<n>We introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability.<n>AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models.
arXiv Detail & Related papers (2025-05-06T09:08:00Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance [81.05882480184587]
We propose a computational framework of Kolb's learning cycle with Vygotsky's ZPD for autonomous agents.<n>Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning.<n>With 9 gold, 8 silver, and 12 bronze medals level performance - including 4 gold and 4 silver on prize-awarding competitions - Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning.
arXiv Detail & Related papers (2024-11-05T23:55:23Z) - Computing a human-like reaction time metric from stable recurrent vision
models [11.87006916768365]
We sketch a general-purpose methodology to construct computational accounts of reaction times from a stimulus-computable, task-optimized model.
We demonstrate that our metric aligns with patterns of human reaction times for stimulus manipulations across four disparate visual decision-making tasks.
This work paves the way for exploring the temporal alignment of model and human visual strategies in the context of various other cognitive tasks.
arXiv Detail & Related papers (2023-06-20T14:56:02Z) - Neural Descent for Visual 3D Human Pose and Shape [67.01050349629053]
We present deep neural network methodology to reconstruct the 3d pose and shape of people, given an input RGB image.
We rely on a recently introduced, expressivefull body statistical 3d human model, GHUM, trained end-to-end.
Central to our methodology, is a learning to learn and optimize approach, referred to as HUmanNeural Descent (HUND), which avoids both second-order differentiation.
arXiv Detail & Related papers (2020-08-16T13:38:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.