Fugu-MT 論文翻訳(概要): CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

論文の概要: CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

arxiv url: http://arxiv.org/abs/2305.14214v2
Date: Mon, 23 Oct 2023 11:17:53 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-25 11:53:20.558730
Title: CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models
Title（参考訳）: CompoundPiece: 言語モデルの分解性能の評価と改善
Authors: Benjamin Minixhofer, Jonas Pfeiffer, Ivan Vuli\'c
Abstract要約: 複合語を構成語に分割する作業である「分解」を体系的に研究する。 We introduced a dataset of 255k compound and non-compound words across 56 various languages obtained from Wiktionary。分割のための専用モデルを訓練するための新しい手法を導入する。
参考スコア（独自算出の注目度）: 77.45934004406283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding, the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised objective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9% accuracy on average. Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece. CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.
Abstract（参考訳）: 多くの言語は複合語を作るために2つ以上の単語を結合するプロセスを持っているが、以前の研究は一般的に過剰に生産的な複合語(例えばドイツ語、オランダ語)を持つ言語に限られており、多くの言語に複合語と非複合語を含む公開データセットは存在しない。本研究では, 複合語を構成語に分割する作業である分解処理を, 大規模に体系的に研究する。まず、Wiktionaryから得られた56の多様な言語に255kの複合語と非複合語のデータセットを導入することで、データギャップに対処する。次に、このデータセットを使用して、分割タスク上のLarge Language Model(LLM)の配列を評価する。 LLMは、特にサブワードトークン化によって不利にトークン化される単語に対して、性能が良くないことがわかった。そこで本研究では,分解のための専用モデルをトレーニングするための新しい手法を提案する。提案した2段階の手順は、第1段階で完全に自己制御された目的に依存し、第2段階の教師付き学習段階は、注釈付きウィキオナリーデータに基づいてモデルを任意に微調整する。我々の自己教師付きモデルは、以前の最良の教師なし推論モデルよりも平均13.9%正確である。私たちの微調整モデルは、以前の(言語固有の)分解ツールよりも優れています。さらに,このモデルを用いて,サブワードトークン生成時のデコンパリングを活用し,これを複合ピースと呼ぶ。コンプレックスピースは、平均でより好適に複合語をトークン化するので、文節のトークン化を用いた同等のモデル上での分解のパフォーマンスが向上する。

論文の概要: CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

関連論文リスト