Related papers: Clarifying Before Reasoning: A Coq Prover with Structural Context

Related papers

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning [31.85615810584119]
We introduce Structure of Thought (SoT), a prompting technique that guides models to construct intermediate text structures.<n>Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models.
arXiv Detail & Related papers (2026-03-04T07:05:09Z)
CL4SE: A Context Learning Benchmark For Software Engineering Tasks [7.899464362501583]
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks.<n>Existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the effects of different contexts.<n>We propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types.<n>We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstreams across nine metrics.
arXiv Detail & Related papers (2026-02-26T14:28:57Z)
Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO [24.91321958525287]
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge.<n>Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable.<n>We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition.
arXiv Detail & Related papers (2026-02-05T05:27:11Z)
CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks [53.88194225946438]
Chain-of-Thought for Detection (CoT4Det) is a simple but efficient strategy that reformulates perception tasks into three interpretable steps.<n>We show that CoT4Det significantly improves perception performance without compromising general vision language capabilities.
arXiv Detail & Related papers (2025-12-07T05:26:30Z)
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z)
CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task [49.27354010985993]
Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities.<n>Existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities.<n>We propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model's capabilities across three dimensions.
arXiv Detail & Related papers (2025-09-29T08:10:29Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Syntactic Control of Language Models by Posterior Inference [53.823006836309695]
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability.<n>We argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation.<n>Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure.
arXiv Detail & Related papers (2025-06-08T14:01:34Z)
Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks [0.7988085110283119]
Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs only marginally outperform random baselines.<n>We provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph.<n> Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods.
arXiv Detail & Related papers (2025-05-23T15:37:40Z)
Protein Structure Tokenization: Benchmarking and New Recipe [16.842453216446987]
StructTokenBench is a framework that comprehensively evaluates the quality and efficiency of structure tokenizers.<n>AminoAseed is a strategy that enhances codebook updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality.<n>Our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively.
arXiv Detail & Related papers (2025-02-28T15:14:33Z)
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution [36.30158138035512]
We present a benchmark consisting of 800 Python functions (3-13 lines) Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. We show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it.
arXiv Detail & Related papers (2024-01-05T20:53:51Z)
An In-Context Learning Agent for Formal Theorem-Proving [10.657173216834668]
We present an in-context learning agent for formal theorem-context in environments like Lean and Coq. COPRA repeatedly asks a large language model to propose tactic applications from within a stateful backtracking search. We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project.
arXiv Detail & Related papers (2023-10-06T16:21:22Z)
KERMIT: Knowledge Graph Completion of Enhanced Relation Modeling with Inverse Transformation [19.31783654838732]
We use large language models to generate coherent descriptions, bridging the semantic gap between queries and answers.<n>We also utilize inverse relations to create a symmetric graph, thereby providing augmented training samples for KGC.<n>Our approach achieves a 4.2% improvement in Hit@1 on WN18RR and a 3.4% improvement in Hit@3 on FB15k-237, demonstrating superior performance.
arXiv Detail & Related papers (2023-09-26T09:03:25Z)
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)
Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks. We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving. This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z)
NLP-IIS@UT at SemEval-2021 Task 4: Machine Reading Comprehension using the Long Document Transformer [8.645929825516816]
This paper presents a technical report of our submission to the 4th task of SemEval-2021, titled: Reading of Abstract Meaning. In this task, we want to predict the correct answer based on a question given a context. To tackle this problem, we used the Longformer model to better process the sequences.
arXiv Detail & Related papers (2021-05-08T20:48:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.