Solar Open Technical Report
- URL: http://arxiv.org/abs/2601.07022v1
- Date: Sun, 11 Jan 2026 18:33:09 GMT
- Title: Solar Open Technical Report
- Authors: Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh,
- Abstract summary: Solar Open demonstrates a systematic methodology for building competitive LLMs.<n>We synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data.<n>We apply our proposed framework SnapPO for efficient optimization.
- Score: 65.93022715874504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
Related papers
- SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation [20.81567866070287]
We present a neural spoken language diarization model that supports an unconstrained span of languages within a single framework.<n>Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data.
arXiv Detail & Related papers (2025-10-01T07:01:33Z) - Breaking Language Barriers: Equitable Performance in Multilingual Language Models [17.343456129678067]
LLMs perform worse in Common Sense Reasoning (CSR) tasks when prompted in low-resource languages (LRLs) like Hindi or Swahili compared to high-resource languages (HRLs) like English.<n>Our approach involves fine-tuning an LLM on synthetic code-switched text generated using controlled language-mixing methods.<n>We present a new dataset of synthetic code-switched text derived from the CommonSenseQA dataset, featuring three distinct language ratio configurations.
arXiv Detail & Related papers (2025-08-18T06:50:24Z) - SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work [87.9341538630949]
The first Sign Language Production Challenge was held as part of the third SLRTP Workshop at CVPR 2025.<n>The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses.<n>This paper presents the challenge design and the winning methodologies.
arXiv Detail & Related papers (2025-08-09T11:57:33Z) - Training Large Recommendation Models via Graph-Language Token Alignment [53.3142545812349]
We propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment.<n>By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs.<n> Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction.
arXiv Detail & Related papers (2025-02-26T02:19:10Z) - Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets.<n>Our approach emphasizes transparency, simplicity, and efficiency.<n>We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z) - ConVerSum: A Contrastive Learning-based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents [4.029675201787349]
Cross-lingual summarization is a sophisticated branch in Natural Language Processing.
There is no feasible solution for CLS when there is no available high-quality CLS data.
We propose a novel data-efficient approach, ConVerSum, for CLS leveraging the power of contrastive learning.
arXiv Detail & Related papers (2024-08-17T19:03:53Z) - LEARN: Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application [54.984348122105516]
Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework synergizes open-world knowledge with collaborative knowledge.<n>We propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge.
arXiv Detail & Related papers (2024-05-07T04:00:30Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Cross-Lingual Semantic Role Labeling with High-Quality Translated
Training Corpus [41.031187560839555]
Cross-lingual semantic role labeling is one promising way to address the problem.
We propose a novel alternative based on corpus translation, constructing high-quality training datasets for the target languages.
Experimental results on Universal Proposition Bank show that the translation-based method is highly effective.
arXiv Detail & Related papers (2020-04-14T04:16:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.