Related papers: Making Qwen3 Think in Korean with Reinforcement Learning

Making Qwen3 Think in Korean with Reinforcement Learning

URL: http://arxiv.org/abs/2508.10355v1
Date: Thu, 14 Aug 2025 05:49:34 GMT
Title: Making Qwen3 Think in Korean with Reinforcement Learning
Authors: Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee,
Abstract summary: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" in Korean.<n>In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning.<n>In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization algorithm.
Score: 5.237306053045462
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.

Related papers

Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction [7.756650000650388]
We investigate whether reinforcement learning can enhance Korean reasoning abilities to a degree comparable to English.<n>Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities.<n>We show that aligning the model's internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL's effectiveness.
arXiv Detail & Related papers (2026-01-09T01:17:31Z)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z)
Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z)
Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective [52.38531288378491]
reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs)<n>In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction.<n>Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration.
arXiv Detail & Related papers (2025-09-26T17:39:48Z)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z)
SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning [21.36638095182274]
Reinforcement learning can sharpen the reasoning ability of large language models (LLMs) by prompting them to "think before answering"<n>We show that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.
arXiv Detail & Related papers (2025-04-22T13:41:26Z)
Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models [0.0]
We introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean assessment.<n>HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation.<n>Its modular registry design also enables rapid incorporation of new datasets, methods, and backends.
arXiv Detail & Related papers (2025-03-29T04:17:58Z)
Multi-Step Reasoning in Korean and the Emergent Mirage [0.0]
We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts.<n>The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps.<n>Our experiments reveal that models trained on fewer than (2 cdot 1025) training FLOPs struggle to solve any questions, showing near-zero performance.
arXiv Detail & Related papers (2025-01-10T05:07:27Z)
RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining [0.0]
We present RedWhale, a model specifically tailored for Korean language processing. RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline. Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks.
arXiv Detail & Related papers (2024-08-21T02:49:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.