Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?
- URL: http://arxiv.org/abs/2507.09638v1
- Date: Sun, 13 Jul 2025 14:05:48 GMT
- Title: Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?
- Authors: Pawitsapak Akarajaradwong, Chompakorn Chaksangchaichot, Pirat Pothavorn, Attapol Thamrongrattanarit-Rutherford, Ekapol Chuangsuwanich, Sarana Nutanong,
- Abstract summary: We introduce an approach aligning Thai legal question answering systems with improved law citation accuracy and better response quality.<n>Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward.<n>Experiments on the NitiBench benchmark demonstrate substantial improvements.
- Score: 7.42457277619017
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.
Related papers
- WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning [67.45237332694025]
Group Relative Policy Optimization is effective for training language models on complex reasoning.<n>We propose Weakly Supervised GRPO, which improves rollout efficiency by converting terminal rewards into correctness-aware guidance.
arXiv Detail & Related papers (2026-02-19T02:43:35Z) - LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - LegalMALR:Multi-Agent Query Understanding and LLM-Based Reranking for Chinese Statute Retrieval [10.997604609194033]
Statute retrieval is essential for legal assistance and judicial decision support.<n>Real-world legal queries are often implicit, multi-issue, and expressed in colloquial or underspecified forms.<n>We present LegalMALR, a retrieval framework that integrates a Multi-Agent Query Understanding System with a zero-shot large-language-generated reranking module.
arXiv Detail & Related papers (2026-01-25T04:44:56Z) - PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z) - ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India [10.522785783474857]
We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback.<n>Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization.
arXiv Detail & Related papers (2025-12-19T19:13:41Z) - ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation [3.173215823388563]
This study explores the fine-tuning of Large Language Models (LLMs) to better support policymakers in their work of understanding, analyzing, and crafting legal regulations.<n>To equip the model with a deep understanding of legal texts, we curated a supervised dataset tailored to the specific needs of the legal domain.<n>This combination of fine-tuning and RAG-based augmentation results in a tool that not only processes legal information but actively assists policymakers in interpreting regulations and drafting new ones that align with current needs.
arXiv Detail & Related papers (2025-11-05T15:45:52Z) - Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference [15.567885200167913]
We introduce Unilaw-R1, a large language model tailored for legal reasoning.<n>With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost.<n>It tackles three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization.
arXiv Detail & Related papers (2025-10-11T07:17:22Z) - GLARE: Agentic Reasoning for Legal Judgment Prediction [60.13483016810707]
Legal judgment prediction (LJP) has become increasingly important in the legal field.<n>Existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge.<n>We introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules.
arXiv Detail & Related papers (2025-08-22T13:38:12Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z) - AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation [5.732271982985626]
AutoLaw is a novel violation detection framework for domain-specific large language models (LLMs)<n>It combines adversarial data generation with a jury-inspired deliberation process to enhance legal compliance of LLMs.<n>Our results highlight the framework's ability to adaptively probe legal misalignments and deliver reliable, context-aware judgments.
arXiv Detail & Related papers (2025-05-20T07:09:13Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [55.06360285372418]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias.<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.
arXiv Detail & Related papers (2025-05-18T11:08:32Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - NitiBench: A Comprehensive Study of LLM Framework Capabilities for Thai Legal Question Answering [4.61348190872483]
This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai financial law, and the NitiBench-Tax, which includes real-world tax law cases.<n>We evaluate retrieval-augmented generation (RAG) and long-context LLM-based approaches to address three key research questions.
arXiv Detail & Related papers (2025-02-15T17:52:14Z) - Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study [9.30538764385435]
Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored.<n>We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations.<n>We then conduct a systematic benchmarking across a range of solutions.<n>Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero.
arXiv Detail & Related papers (2024-12-09T07:46:14Z) - Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought [61.588465852846646]
Chain-of-Thought (CoT) reasoning has emerged as a promising approach for enhancing the performance of large language models (LLMs)
In this work, we introduce a novel reasoning boundary framework (RBF) to address these challenges.
arXiv Detail & Related papers (2024-10-08T05:26:28Z) - Enabling Discriminative Reasoning in LLMs for Legal Judgment Prediction [23.046342240176575]
We introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspired by human reasoning.
ADAPT involves decomposing case facts, discriminating among potential charges, and predicting the final judgment.
Experiments conducted on two widely-used datasets demonstrate the superior performance of our framework in legal judgment prediction.
arXiv Detail & Related papers (2024-07-02T05:43:15Z) - Large Language Models as Tax Attorneys: A Case Study in Legal
Capabilities Emergence [5.07013500385659]
This paper explores Large Language Models' (LLMs) capabilities in applying tax law.
Our experiments demonstrate emerging legal understanding capabilities, with improved performance in each subsequent OpenAI model release.
Findings indicate that LLMs, particularly when combined with prompting enhancements and the correct legal texts, can perform at high levels of accuracy but not yet at expert tax lawyer levels.
arXiv Detail & Related papers (2023-06-12T12:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.