PLaMo 2 Technical Report
- URL: http://arxiv.org/abs/2509.04897v2
- Date: Thu, 25 Sep 2025 06:33:51 GMT
- Title: PLaMo 2 Technical Report
- Authors: Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Toru Ogawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Kuniyuki Takahashi, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu,
- Abstract summary: We introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture.<n>PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
- Score: 9.166942912957724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
Related papers
- Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes [7.998815625852598]
We introduce a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding.<n>Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations.<n>We detail a robust Reinforcement Learning Fine-Tuning pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse.
arXiv Detail & Related papers (2025-12-11T00:51:18Z) - Improving LLMs for Machine Translation Using Synthetic Preference Data [0.0]
We explore how a general instruction can be improved for machine translation using relatively few easily produced data resources.<n>Using Slovene large language model, we improve GaMSBInstruct model using Preference Optimization (DPO)<n>We generated its training by translating English Wikipedia articles using two LLMs, GaMSBInstruct and EuroLLM-9BInstruct.<n>In comparison to the baseline model, the finetuned model achieved a COMET score gain of around 0.04 and 0.02, on translating Wikipedia articles.
arXiv Detail & Related papers (2025-08-20T14:24:16Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z) - Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs [96.68469559192846]
We present two differently sized MoE large language models (LLMs)<n>Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters.<n>We propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency.
arXiv Detail & Related papers (2025-03-07T04:43:39Z) - Kanana: Compute-efficient Bilingual Language Models [9.597618914676106]
Kanana is a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English.<n>The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models.<n>The report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling.
arXiv Detail & Related papers (2025-02-26T08:36:20Z) - PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency [4.122864669557465]
We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency.
The model was trained from scratch using 2 trillion tokens.
Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks.
arXiv Detail & Related papers (2024-10-10T02:59:36Z) - RoBLEURT Submission for the WMT2021 Metrics Task [72.26898579202076]
We present our submission to the Shared Metrics Task: RoBLEURT.
Our model reaches state-of-the-art correlations with the WMT 2020 human annotations upon 8 out of 10 to-English language pairs.
arXiv Detail & Related papers (2022-04-28T08:49:40Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.