Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks
- URL: http://arxiv.org/abs/2503.13988v1
- Date: Tue, 18 Mar 2025 07:44:49 GMT
- Title: Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks
- Authors: Mykyta Syromiatnikov, Victoria Ruvinskaya, Nataliia Komleva,
- Abstract summary: Small or compact models, though more efficient, often lack sufficient support for underrepresented languages.<n>This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks.<n> tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain-relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlight that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models. This research also evaluates how merging the quantized adapter with the base model influences the generation quality. Source code and tuned models are available at https://github.com/NLPForUA/ZNO.
Related papers
- S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation.<n>S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting [7.146498833443095]
Concept Distillation (CD) is an automatic prompt optimization technique for enhancing weaker models on complex tasks.
CD involves: (1) collecting mistakes made by weak models with a base prompt (initialization), (2) using a strong model to generate reasons for these mistakes and create rules/concepts for weak models (induction), and (3) filtering these rules based on validation set performance.
We evaluated CD on NL2Code and mathematical reasoning tasks, observing significant performance boosts for small and weaker language models.
arXiv Detail & Related papers (2024-08-18T05:37:48Z) - MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases [46.997172696192195]
This paper addresses the need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns.
We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment.
arXiv Detail & Related papers (2024-02-22T18:58:55Z) - Large language models for aspect-based sentiment analysis [0.0]
We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings.
Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task.
arXiv Detail & Related papers (2023-10-27T10:03:21Z) - Text Alignment Is An Efficient Unified Model for Massive NLP Tasks [24.069447197357164]
Next-word prediction is often not an efficient formulation for many NLP tasks.
We propose text alignment as an efficient unified model for a wide range of crucial tasks.
Our model delivers on par or even superior performance with much smaller model sizes.
arXiv Detail & Related papers (2023-07-06T02:28:31Z) - Chain-of-Thought Hub: A Continuous Effort to Measure Large Language
Models' Reasoning Performance [35.38549845444575]
Chain-of-Thought Hub is an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-05-26T23:46:42Z) - The False Promise of Imitating Proprietary LLMs [158.65692029352584]
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model.
This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model.
We first finetune a series of LMs that imitate ChatGPT using varying base model sizes.
We then evaluate the models using crowd raters and canonical NLP benchmarks.
arXiv Detail & Related papers (2023-05-25T05:00:12Z) - Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple
Tasks [77.90900650816046]
We introduce $textZemi$, a zero-shot semi-parametric language model.
We train $textZemi$ with a novel semi-parametric multitask prompted training paradigm.
Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus.
arXiv Detail & Related papers (2022-10-01T04:08:50Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.