Related papers: Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback

Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback

URL: http://arxiv.org/abs/2412.06827v1
Date: Fri, 06 Dec 2024 21:17:47 GMT
Title: Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback
Authors: Avinash Anand, Kritarth Prasad, Chhavi Kirtani, Ashwin R Nair, Mohit Gupta, Saloni Garg, Anurag Gautam, Snehal Buldeo, Rajiv Ratn Shah,
Abstract summary: Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems.<n>This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF)
Score: 33.000541253136745
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems, particularly in advanced arithmetic and conceptual understanding. While some research has explored ways to enhance LLMs in physics education using techniques such as prompt engineering and Retrieval Augmentation Generation (RAG), not enough effort has been made in addressing their limitations in physics reasoning. This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF). We evaluate several reinforcement learning methods, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Remax optimization. These methods are chosen to investigate RL policy performance with different settings on the PhyQA dataset, which includes challenging physics problems from high school textbooks. Our RLHAIF model, tested on leading LLMs like LLaMA2 and Mistral, achieved superior results, notably with the MISTRAL-PPO model, demonstrating marked improvements in reasoning and accuracy. It achieved high scores, with a 58.67 METEOR score and a 0.74 Reasoning score, making it a strong example for future physics reasoning research in this area.

Related papers

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems [3.0901186959880977]
We evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive.<n>We introduce a new evaluation benchmark for physics problems, $rm Psmall HYSICSEsmall VAL$, consisting of 19,609 problems sourced from various physics textbooks.
arXiv Detail & Related papers (2025-07-31T18:12:51Z)
Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z)
Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment [0.0]
Large language models (LLMs) are now widely accessible, reaching learners at all educational levels.<n>This study compares the problem-solving performance of a general-purpose LLM (GPT-4o) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad.
arXiv Detail & Related papers (2025-05-14T14:46:32Z)
Training Large Language Models to Reason via EM Policy Gradient [0.27195102129094995]
We introduce an off-policy reinforcement learning algorithm, EM Policy Gradient, to enhance LLM reasoning. We evaluate the effectiveness of EM Policy Gradient on the GSM8K and MATH (HARD) datasets. Models fine-tuned with our method exhibit cognitive behaviors, such as sub-problem decomposition, self-verification, and backtracking.
arXiv Detail & Related papers (2025-04-24T01:31:05Z)
A Survey on Mathematical Reasoning and Optimization with Large Language Models [0.5439020425819]
Recent advancements in Large Language Models (LLMs) have significantly improved AI-driven mathematical reasoning, theorem proving, and optimization techniques. This survey explores the evolution of mathematical problem-solving in AI, from early statistical learning approaches to modern deep learning and transformer-based methodologies.
arXiv Detail & Related papers (2025-03-22T10:49:32Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents [27.112239616508834]
Mixture of Refinement Agents (MoRA) is a novel agentic refinement framework for large language models (LLMs)<n>MoRA iteratively refines the LLM generated base solution by correcting the aforementioned errors, resulting in a significant performance improvement for open-source LLMs.<n>We evaluate our approach on the SciEval and MMLU subsets along with our own physics dataset (PhysicsQA)
arXiv Detail & Related papers (2024-12-01T14:15:55Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs [12.48241058167222]
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions. But studies reveal that they often struggle with tasks requiring reasoning, such as math or physics limitation. This raises questions about whether LLMs truly comprehend embedded knowledge or merely learn to replicate the token distribution without a true understanding of the content. We propose Decon Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T13:17:09Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering [32.87943023416162]
We propose an LMM-based model to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors.
arXiv Detail & Related papers (2024-04-19T14:52:57Z)
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey [54.19942426544731]
The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains. This paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs.
arXiv Detail & Related papers (2023-12-01T16:00:25Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.