Fugu-MT 論文翻訳(概要): Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

論文の概要: Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

arxiv url: http://arxiv.org/abs/2305.12182v2
Date: Fri, 26 May 2023 11:30:08 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-29 19:33:09.681872
Title: Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Title（参考訳）: Glot500: 多言語コーパスと言語モデルを500言語に拡張
Authors: Ayyoob Imani and Peiqin Lin and Amir Hossein Kargaran and Silvia Severini and Masoud Jalili Sabet and Nora Kassner and Chunlan Ma and Helmut Schmid and Andr\'e F. T. Martins and Fran\c{c}ois Yvon and Hinrich Sch\"utze
Abstract要約: Glot500-mは水平スケールのLarge Language Models (LLMs) で、主に低リソース言語511をカバーする。この取り組みの重要な部分は、これら511言語をカバーするコーパスであるGlot500-cの収集とクリーン化である。我々は、XLM-Rベースラインと比較して、高リソース言語と低リソース言語の両方で大幅に改善されていることを観察する。
参考スコア（独自算出の注目度）: 8.298465385153527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.
Abstract（参考訳）: NLPコミュニティは、主にLLM(Large Language Models)を垂直にスケーリングすることに重点を置いており、約100言語で改善されている。 511の低リソース言語をカバーするLLMであるGlot500-mを作成します。この取り組みの重要な部分は、これらの511言語をカバーし、Glot500-mのトレーニングを可能にするコーパスであるGlot500-cの収集とクリーン化である。これらの言語にまたがる5つのタスクについてGlot500-mを評価する。我々は、XLM-Rベースラインと比較して、高リソース言語と低リソース言語の両方に大きな改善が見られた。解析の結果,多言語LLM表現の質を説明する要因は存在しないことがわかった。むしろ、要因の組み合わせは、コーパスサイズ、スクリプト、関連する言語からの"help"、モデルの総容量を含む品質を決定する。我々の研究は、NLP研究の重要な目標に対処する。我々は、NLPを世界の少数の言語に限らず、可能な限り多くの言語をサポートし、すべての言語や文化にNLP技術の利点をもたらすよう努力すべきである。コード、データ、モデルはhttps://github.com/cisnlp/glot500で入手できる。

関連論文リスト

Goldfish: Monolingual Language Models for 350 Languages [23.365111479090626]
多くの低リソース言語において、利用可能な言語モデルは、多くの言語で同時に訓練された大きな多言語モデルのみである。 Goldfishはモノリンガルな自動回帰変換言語モデルで350言語に対して最大125Mのパラメータを出力する。
論文参考訳（メタデータ） (2024-08-19T22:31:21Z)
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
大規模言語モデル(LLM)は、自然言語処理タスクにおいて驚くほどの習熟度を示している。 LLMは、トレーニングデータが少ないため、低リソースの言語でよく機能するのに苦労することが多い。本研究では,世界5000万人以上の人々が話す言語であるAmharicを話すためのLLaMA-2の訓練について検討する。
論文参考訳（メタデータ） (2024-03-11T01:04:36Z)
Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions [49.97641297850361]
lingOLLMは、LLMが事前トレーニングでほとんど起こらない未知の言語を処理できるようにする、トレーニング不要のアプローチである。 GPT-4とMixtralの2つのモデル上にlingOLLMを実装し,その性能評価を行った。 GPT-4 の 0 から 10.5 BLEU への翻訳能力が 10 言語方向に向上することを示す。
論文参考訳（メタデータ） (2024-02-28T03:44:01Z)
MaLA-500: Massive Language Adaptation of Large Language Models [61.440556436524]
MALA-500は、幅広い534言語をカバーするように設計された、新しい大きな言語モデルである。我々の本質的な評価は,MALA-500 が既存の多言語 LLM よりも低リソース言語のテキストの予測に優れていることを示している。
論文参考訳（メタデータ） (2024-01-24T08:57:39Z)
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
本稿では,mPLM を微調整する TransliCo を提案する。 Furinaは様々なゼロショット・クロスリンガル・トランスファータスクにおいてオリジナルのGlot500-mより優れていることを示す。
論文参考訳（メタデータ） (2024-01-12T15:12:48Z)
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages [54.832599498774464]
我々は,言語連鎖に基づく新しいアプローチにより,多言語単語埋め込み(MWE)を構築することを提案する。リソースの豊富なソースから始めて、ターゲットに到達するまで各言語をチェーンに順次追加することで、MWEを一度に1つの言語で構築します。本手法は,4つの低リソース(5Mトークン)と4つの中程度の低リソース(50M)ターゲット言語を含む4つの言語ファミリーを対象としたバイリンガルレキシコン誘導法について検討した。
論文参考訳（メタデータ） (2023-11-21T09:59:29Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M は広い範囲、信頼性、効率性のデシラタを満たす LID モデルである。 1665の言語を識別し、以前の作業に比べてカバー範囲が大幅に増加した。
論文参考訳（メタデータ） (2023-10-24T23:45:57Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
大規模言語モデル(LLM)は、少数の例を単純に観察することで、効果的にタスクを実行することが知られている。我々は,LLMが任意の言語から英語に翻訳するよう促すために,多種多様な高ソース言語から合成例を組み立てることを提案する。我々の教師なしプロンプト法は、英語と13のIndic言語と21のアフリカ低リソース言語間の翻訳において、異なる大きさのLLMにおける教師付き少ショット学習と同等に機能する。
論文参考訳（メタデータ） (2023-06-20T08:27:47Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。