Fugu-MT 論文翻訳(概要): Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

論文の概要: Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

arxiv url: http://arxiv.org/abs/2606.18717v1
Date: Wed, 17 Jun 2026 05:55:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.026535
Title: Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Title（参考訳）: Morpheus: トルコ語用形態認識型ニューラルトケナイザーと単語埋め込み器
Authors: Tolga Şakar,
Abstract要約: 本稿では,トルコのニューラルな形態素境界モデルである textbfMorpheus について述べる。微分可能なポアソン二項動的プログラムは、キャラクタごとの境界確率をソフトなモルデムのメンバーシップに変える。 Morpheusは1文字あたり最低のビット(1文字あたり1.425ドル)を獲得し、サブワードファミリーの金の形態的アライメントをほぼ2倍にする。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.
Abstract（参考訳）: トルコ語は凝集的であり、意味はモルヒムによって運ばれるが、現代言語モデルを動かすサブワードトークンーはコーパス統計によって単語を分割し、セマンティックにロードされた接尾辞を断片化し、WordPieceやルールベースのアナライザの場合、元のテキストに出力を復号できない。本稿では,トルコのニューラルな形態素境界モデルである「textbf{Morpheus}」について述べる。微分可能なポアソン・双項動的プログラムは、訓練中に各キャラクタの境界確率を、文字列正規化を伴わずに、推論時に厳密なセグメントをソフトなモルフィムメンバシップに変えるので、$\mathrm{decode}(\mathrm{encode}(w)) = w$ は構成によって保持される。モデルがニューラルであるため、トークン化するのと同じフォワードパスも構造化された単語の埋め込みを出力する。可逆トークン化器(生成に有効な唯一のもの)の中で、Morpheusは1文字あたりの最低ビット数(1.425ドル)を達成し、サブワードファミリー(MorphScore macro-F1 $0.61$ vs.)のゴールドモルフォロジーアライメントをほぼ2倍にする。 ${\sim}0.32$)と${\sim}19\%$のGPUメモリは、64K語彙のサブワードトークンよりも少ない。埋め込みとして、凍結したMorpheusベクトルは語彙検索(root- Family MAP $0.85$)と同根検証(ROC-AUC $1.00$)につながり、多言語レトリバーBGE-M3とBERTurkを上回る。コード:https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; Interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo

論文の概要: Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

関連論文リスト