Fugu-MT 論文翻訳(概要): Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

論文の概要: Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

arxiv url: http://arxiv.org/abs/2508.13525v1
Date: Tue, 19 Aug 2025 05:33:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.804737
Title: Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
Title（参考訳）: サウジ・ディアレクター・アラーM:ロラファインチューニングによるアラビア語の方言生成
Authors: Hassan Barmandah,
Abstract要約: アラビア語の大きな言語モデル(LLM)は現代標準アラビア語(MSA)に支配されている。この表現不足は、真正な方言の変化を捉える能力を妨げている。サウジアラビア方言教育データセットを用いて,サウジアラビア方言生成の基礎モデルを構築した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.
Abstract（参考訳）: アラビア語の大きな言語モデル(LLM)は現代標準アラビア語(MSA)に支配されており、ナジュディ語やヒジャジ語のようなサウジアラビアの方言を限定的にサポートしている。この表現不足は、真正な方言の変化を捉える能力を妨げている。サウジアラビアで最初に開発された基礎モデルであるLoRA-tune ALLaM-7B-Instruct-previewを用いてサウジアラビアの方言を生成させた。 2つの変種について検討する。一指示に明示的な方言タグを付与する方言訓練、及び (ii)フォーマット時にタグを省略するノートークントレーニング。ホールドアウトテストセットの評価は、外部方言分類器とテキスト忠実度指標(chrF++とBERTScore)と多様性尺度を組み合わせたものである。 Dialect-Tokenモデルでは、サウジアラビアのレートを47.97%から84.21%に引き上げ、MSAリークを32.63%から6.21%に下げ、フィデリティも改善されている(chrF++ +3.53, BERTScore +0.059)。両方のLoRA変種は、方言制御と忠実度において強い汎用的命令モデル(Falcon-7B-インストラクト、Llama-3.1-8B-インストラクト、Qwen-2.5-7B-インストラクト、AceGPT-v2-8B-Chat、JAIS-13B-Chat)より優れており、これらのベースラインが頻繁に現れるメタデータタグのエコーを回避している。データセットやモデルウェイト/アダプタはリリースせず、独立した検証をサポートするためにトレーニング/評価/推論コードと詳細なデータシート(スキーマと集計統計)をリリースします。

論文の概要: Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

関連論文リスト