Clarifying Before Reasoning: A Coq Prover with Structural Context
- URL: http://arxiv.org/abs/2507.02541v1
- Date: Thu, 03 Jul 2025 11:35:34 GMT
- Title: Clarifying Before Reasoning: A Coq Prover with Structural Context
- Authors: Yanzhen Lu, Hanbin Yang, Xiaodie Wang, Ge Zhang, Biao Li, Chenxu Fu, Chao Li, Yang Yuan, Andrew Chi-Chih Yao,
- Abstract summary: We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context leads to a 1.85$times$ improvement in clarity score.<n>We evaluate this on 1,386 theorems randomly sampled from 15 standard Coq packages.
- Score: 13.273599284897411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate whether improving task clarity can enhance reasoning ability of large language models, focusing on theorem proving in Coq. We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context to the standard input used by modern LLMs, leads to a 1.85$\times$ improvement in clarity score (44.5\%~$\rightarrow$~82.3\%). Using the general-purpose model \texttt{DeepSeek-V3}, our approach leads to a 2.1$\times$ improvement in proof success (21.8\%~$\rightarrow$~45.8\%) and outperforms the previous state-of-the-art \texttt{Graph2Tac} (33.2\%). We evaluate this on 1,386 theorems randomly sampled from 15 standard Coq packages, following the same evaluation protocol as \texttt{Graph2Tac}. Furthermore, fine-tuning smaller models on our structured data can achieve even higher performance (48.6\%). Our method uses selective concept unfolding to enrich task descriptions, and employs a Planner--Executor architecture. These findings highlight the value of structured task representations in bridging the gap between understanding and reasoning.
Related papers
- T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning [31.85615810584119]
We introduce Structure of Thought (SoT), a prompting technique that guides models to construct intermediate text structures.<n>Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models.
arXiv Detail & Related papers (2026-03-04T07:05:09Z) - CL4SE: A Context Learning Benchmark For Software Engineering Tasks [7.899464362501583]
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks.<n>Existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the effects of different contexts.<n>We propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types.<n>We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstreams across nine metrics.
arXiv Detail & Related papers (2026-02-26T14:28:57Z) - Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO [24.91321958525287]
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge.<n>Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable.<n>We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition.
arXiv Detail & Related papers (2026-02-05T05:27:11Z) - CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks [53.88194225946438]
Chain-of-Thought for Detection (CoT4Det) is a simple but efficient strategy that reformulates perception tasks into three interpretable steps.<n>We show that CoT4Det significantly improves perception performance without compromising general vision language capabilities.
arXiv Detail & Related papers (2025-12-07T05:26:30Z) - CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z) - CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task [49.27354010985993]
Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities.<n>Existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities.<n>We propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model's capabilities across three dimensions.
arXiv Detail & Related papers (2025-09-29T08:10:29Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Syntactic Control of Language Models by Posterior Inference [53.823006836309695]
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability.<n>We argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation.<n>Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure.
arXiv Detail & Related papers (2025-06-08T14:01:34Z) - Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks [0.7988085110283119]
Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs only marginally outperform random baselines.<n>We provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph.<n> Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods.
arXiv Detail & Related papers (2025-05-23T15:37:40Z) - Protein Structure Tokenization: Benchmarking and New Recipe [16.842453216446987]
StructTokenBench is a framework that comprehensively evaluates the quality and efficiency of structure tokenizers.<n>AminoAseed is a strategy that enhances codebook updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality.<n>Our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively.
arXiv Detail & Related papers (2025-02-28T15:14:33Z) - EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain.
We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales.
Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution [36.30158138035512]
We present a benchmark consisting of 800 Python functions (3-13 lines)
Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction.
We show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it.
arXiv Detail & Related papers (2024-01-05T20:53:51Z) - An In-Context Learning Agent for Formal Theorem-Proving [10.657173216834668]
We present an in-context learning agent for formal theorem-context in environments like Lean and Coq.
COPRA repeatedly asks a large language model to propose tactic applications from within a stateful backtracking search.
We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project.
arXiv Detail & Related papers (2023-10-06T16:21:22Z) - KERMIT: Knowledge Graph Completion of Enhanced Relation Modeling with Inverse Transformation [19.31783654838732]
We use large language models to generate coherent descriptions, bridging the semantic gap between queries and answers.<n>We also utilize inverse relations to create a symmetric graph, thereby providing augmented training samples for KGC.<n>Our approach achieves a 4.2% improvement in Hit@1 on WN18RR and a 3.4% improvement in Hit@3 on FB15k-237, demonstrating superior performance.
arXiv Detail & Related papers (2023-09-26T09:03:25Z) - Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.
Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training.
Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z) - Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification.
Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z) - Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks.
We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving.
This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z) - NLP-IIS@UT at SemEval-2021 Task 4: Machine Reading Comprehension using
the Long Document Transformer [8.645929825516816]
This paper presents a technical report of our submission to the 4th task of SemEval-2021, titled: Reading of Abstract Meaning.
In this task, we want to predict the correct answer based on a question given a context.
To tackle this problem, we used the Longformer model to better process the sequences.
arXiv Detail & Related papers (2021-05-08T20:48:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.