Related papers: NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

URL: http://arxiv.org/abs/2603.02504v2
Date: Wed, 04 Mar 2026 13:59:59 GMT
Title: NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect
Authors: Pratibha Zunjare, Michael Hsiao,
Abstract summary: Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning.<n>We present textbfNeuroProlog, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs.
Score: 0.12277343096128711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, $p < 0.01$), +3.43\% (GPT-OSS-20B, $p < 0.01$), and +5.54\% (Llama-3B, $p < 0.05$) over single-task baselines. Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.

Related papers

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization [6.572539312871392]
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
arXiv Detail & Related papers (2026-02-17T20:20:33Z)
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z)
AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent [80.83250816918861]
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought.<n>However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations.<n>We present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision.
arXiv Detail & Related papers (2025-12-23T19:57:49Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Training Language Models to Use Prolog as a Tool [2.4305775926851334]
We fine-tune language models to use Prolog as an external tool for verifiable computation.<n>Our results show that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications.
arXiv Detail & Related papers (2025-12-08T10:39:38Z)
Teaching Language Models to Reason with Tools [73.21700643314917]
We present emphHint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths.<n>CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model.
arXiv Detail & Related papers (2025-10-23T08:41:44Z)
Automated Repair of C Programs Using Large Language Models [0.0]
This study explores the potential of Large Language Models (LLMs) in automating the repair of C programs.<n>We present a framework that integrates spectrum-based fault localization (SBFL), runtime feedback, and Chain-of-Thought-structured prompting into an autonomous repair loop.<n>Our approach achieves 44.93% repair accuracy, representing a 3.61% absolute improvement over strong state-of-the-art APR baselines.
arXiv Detail & Related papers (2025-09-02T04:34:11Z)
Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving [9.550050299909184]
We present a new approach that integrates Chain-of-Thought (CoT) with formal language.<n>The model interleaves natural language reasoning with incremental emission of solver-executable code.<n>Built on Qwen2.5-VL-7B, our new model, GF-Reasoner, achieves up to 15% accuracy improvements on standard GPS benchmarks.
arXiv Detail & Related papers (2025-08-12T17:26:23Z)
Syntactic Control of Language Models by Posterior Inference [53.823006836309695]
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability.<n>We argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation.<n>Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure.
arXiv Detail & Related papers (2025-06-08T14:01:34Z)
OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents [8.441638148384389]
We introduce OptimAI, a framework for solving Optimization problems described in natural language.<n>Our framework is built upon the following key roles: formulator, planner, coder and code critic.<n>Our approach attains 88.1% accuracy on the NLP4LP dataset and 82.3% on the Optibench dataset, reducing error rates by 58% and 52%, respectively, over prior best results.
arXiv Detail & Related papers (2025-04-23T17:45:05Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.