Fugu-MT 論文翻訳(概要): Omnilingual MT: Machine Translation for 1,600 Languages

論文の概要: Omnilingual MT: Machine Translation for 1,600 Languages

arxiv url: http://arxiv.org/abs/2603.16309v1
Date: Tue, 17 Mar 2026 09:43:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.204923
Title: Omnilingual MT: Machine Translation for 1,600 Languages
Title（参考訳）: Omnilingual MT: 1,600言語に対する機械翻訳
Authors: Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà,
Abstract要約: 我々は,1600以上の言語をサポートする最初の機械翻訳システムであるOmnilingual Machine Translation (OMT)を提案する。このスケールは、大規模な公開多言語コーパスと新たに作成されたデータセットを統合する包括的なデータ戦略によって実現されている。 OMTモデルは言語間移動を改善し、1,600の評価において、MTのパズルの「理解」部分を解くのに近づいている。
参考スコア（独自算出の注目度）: 58.66170104105936
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
Abstract（参考訳）: 高品質機械翻訳(MT)は数百の言語に拡張可能であり、多言語システムにおいて高いバーを設定する。しかし、世界の7000の言語と比較して、現在のシステムは依然として限られた範囲しか提供していない:ターゲット側の言語は約200、ソース側の言語は数百、言語間の移動によってサポートされている。これらの数字でさえ、信頼できるベンチマークとメトリクスが欠如しているため、評価が難しい。我々は,1600以上の言語をサポートする最初のMTシステムであるOmnilingual Machine Translation (OMT)を提案する。このスケールは、大規模な公開多言語コーパスと、手作業でキュレートされたMeDLEY bitextを含む新たに作成されたデータセットを統合する包括的なデータ戦略によって実現されている。本稿では,機械翻訳のためのLarge Language Model (LLM) を,デコーダのみのモデル (OMT-LLaMA) として,あるいはエンコーダ-デコーダアーキテクチャ (OMT-NLLB) のモジュールとして検討する。特に、1Bから8Bのパラメータモデルはすべて、70B LLMベースラインのMT性能に適合または超過し、明確な特殊化の優位性を示し、低計算条件下での強い翻訳品質を実現する。さらに、英語から1,600の翻訳を評価したところ、ベースラインモデルはサポートされていない言語を解釈できるが、意味のある忠実さで生成できないことがしばしば示され、OMT-LLaMAモデルは一貫性のある生成が可能な言語の集合を著しく拡張する。さらに、OMTモデルは言語間移動を改善し、1,600 で評価された MT のパズルの「理解」部分を解くのに近づいている。我々のリーダーボードと主要な人間による評価データセット(BOUQuETとMet-BOUQuET)は、Omnilingualityに向けて動的に進化し、自由に利用できる。

論文の概要: Omnilingual MT: Machine Translation for 1,600 Languages

関連論文リスト