GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization
- URL: http://arxiv.org/abs/2506.07160v2
- Date: Mon, 30 Jun 2025 13:36:32 GMT
- Title: GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization
- Authors: Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, Jiaqi Wang,
- Abstract summary: Group Contrastive Policy Optimization (GCPO) is a novel reinforcement learning framework featuring two key innovations.<n>We develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction.
- Score: 63.107398132743825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.
Related papers
- Group Causal Policy Optimization for Post-Training Large Language Models [10.791474908144703]
Group Relative Policy Optimization ( GRPO) treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction.<n>We propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components.<n>GCPO consistently surpasses existing methods, including GRPO across multiple reasoning benchmarks.
arXiv Detail & Related papers (2025-08-07T14:17:28Z) - Geometric-Mean Policy Optimization [122.95205388291987]
We propose a stabilized variant of Group Relative Policy Optimization ( GRPO)<n>Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards.<n>Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark.
arXiv Detail & Related papers (2025-07-28T09:54:05Z) - Hyperbolic Deep Learning for Foundation Models: A Survey [16.14776172953206]
Foundation models pre-trained on massive datasets have demonstrated remarkable success in diverse downstream tasks.<n>Recent advances have leveraged hyperbolic neural networks to enhance foundation models.<n>This paper provides a comprehensive review of hyperbolic neural networks and their recent development for foundation models.
arXiv Detail & Related papers (2025-07-23T09:50:17Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO [68.44918104224818]
Autoregressive image generation presents unique challenges distinct from Chain-of-Thought (CoT) reasoning.<n>This study provides the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation.<n>Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms.
arXiv Detail & Related papers (2025-05-22T17:59:49Z) - Training Large Language Models to Reason via EM Policy Gradient [0.27195102129094995]
We introduce an off-policy reinforcement learning algorithm, EM Policy Gradient, to enhance LLM reasoning.<n>We evaluate the effectiveness of EM Policy Gradient on the GSM8K and MATH (HARD) datasets.<n>Models fine-tuned with our method exhibit cognitive behaviors, such as sub-problem decomposition, self-verification, and backtracking.
arXiv Detail & Related papers (2025-04-24T01:31:05Z) - SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM [18.275547804539016]
Two-Staged history-Resampling Policy optimization surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks.<n>We introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples.
arXiv Detail & Related papers (2025-04-19T13:06:03Z) - GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models [0.17265013728931003]
GRPO-LEAD is a suite of novel enhancements tailored for mathematical reasoning.<n>It introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems.
arXiv Detail & Related papers (2025-04-13T19:07:45Z) - Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries [42.83280708842304]
Euclidean space has been the de facto geometric setting for machine learning architectures.<n>At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling.<n>This paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models.
arXiv Detail & Related papers (2025-04-11T18:07:33Z) - Do Large Language Models Truly Understand Geometric Structures? [15.915781154075615]
We introduce the GeomRel dataset to evaluate large language models' understanding of geometric structures.<n>We propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs' ability to identify geometric relationships.
arXiv Detail & Related papers (2025-01-23T15:52:34Z) - Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning [53.13514542825493]
We introduce a two-stage Theorem-d Reverse Chain-of-Thought Reasoning Synthesis (TRCoT) framework.<n>The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties.<n>The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments.
arXiv Detail & Related papers (2024-10-23T13:58:39Z) - RMLR: Extending Multinomial Logistic Regression into General Geometries [64.16104856124029]
Our framework only requires minimal geometric properties, thus exhibiting broad applicability.
We develop five families of SPD MLRs under five types of power-deformed metrics.
On rotation matrices we propose Lie MLR based on the popular bi-invariant metric.
arXiv Detail & Related papers (2024-09-28T18:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.