Fugu-MT 論文翻訳(概要): LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

論文の概要: LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

arxiv url: http://arxiv.org/abs/2603.22629v1
Date: Mon, 23 Mar 2026 23:07:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.215507
Title: LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
Title（参考訳）: LGSE:低リソース言語適応の初期化のための語彙的に接頭したサブワード埋め込み
Authors: Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl,
Abstract要約: 本稿では,新しいトークンの埋め込みを初期化するための形態的情報セグメント化を導入したLexically Grounded Subword Embedding Initializationフレームワークを提案する。ランダムなベクトルや任意のサブワードを使う代わりに、LGSEは単語を構成形態素に分解し、意味的に一貫性のある埋め込みを構成する。質問応答、名前付きエンティティ認識、テキスト分類の3つのNLPタスクにおいて、LGSEを2つの形態的にリッチで低リソースな言語で評価する。
参考スコア（独自算出の注目度）: 7.623227616015147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from their initialized values, preserving alignment with the original pretrained embedding space while enabling adaptation to the target language. To isolate the effect of initialization, we retain the original pre-trained model vocabulary and tokenizer and update only the new embeddings during adaptation. We evaluate LGSE on three NLP tasks: Question Answering, Named Entity Recognition, and Text Classification, in two morphologically rich, low-resource languages: Amharic and Tigrinya, where morphological segmentation resources are available. Experimental results show that LGSE consistently outperforms baseline methods across all tasks, demonstrating the effectiveness of morphologically grounded embedding initialization for improving representation quality in underrepresented languages. Project resources are available in the GitHub link.
Abstract（参考訳）: 訓練済みの言語モデルを低リソースで形態的に豊かな言語に適応させることは、依然として重要な課題である。既存の語彙展開法は、通常任意に区切られたサブワード単位に依存し、断片化された語彙表現と臨界形態情報の喪失をもたらす。この制限に対処するため,新しいトークンの埋め込みを初期化するための形態情報分割を導入したLexically Grounded Subword Embedding Initialization (LGSE) フレームワークを提案する。ランダムなベクトルや任意のサブワードを使う代わりに、LGSEは単語を構成形態素に分解し、事前訓練されたサブワードやFastTextベースの形態素表現を平均化することによって意味的に一貫性のある埋め込みを構築する。トークンを意味のある形態素に分割できない場合、その埋め込みは構造情報を取得するために文字n-gram表現を用いて構築される。言語適応型事前学習では、初期化値から新しく導入された埋め込みの大規模なずれを罰する正規化項を適用し、対象言語への適応を可能としながら、元の事前学習された埋め込み空間との整合性を保つ。初期化の効果を分離するために,初期訓練済みのモデル語彙とトークン化剤を保持し,適応中の新しい埋め込みのみを更新する。質問応答,名前付きエンティティ認識,テキスト分類の3つのNLPタスクにおいて,LGSEを2つの形態学的に豊かな低リソース言語であるAmharicとTigrinyaで評価する。実験結果から,LGSEは全てのタスクのベースライン手法を一貫して上回り,形態素基底埋め込み初期化の有効性を実証し,表現品質の向上を図っている。プロジェクトのリソースはGitHubのリンクで入手できる。

論文の概要: LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

関連論文リスト