Fugu-MT 論文翻訳(概要): LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

論文の概要: LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

arxiv url: http://arxiv.org/abs/2512.07522v1
Date: Mon, 08 Dec 2025 12:59:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.894619
Title: LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
Title（参考訳）: LIME:LLMデータを言語メタデータの埋め込みでより効率的にする
Authors: Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Björn Deiseroth,
Abstract要約: LIME(Linguistic Metadata Embeddings)は,メタデータのメタデータを付加したトークンの埋め込みを,構文,セマンティクス,コンテキストプロパティなどによって強化する手法である。 LIMEはトレーニング前の効率を大幅に改善する。具体的には、トレーニングデータ分布に最大56%高速に対応し、無視可能な計算オーバーヘッドでは0.01%追加パラメータしか導入しない。さらに,トークン生成をガイドできるメタデータシフト型 LIME+1 を開発した。
参考スコア（独自算出の注目度）: 44.57551925823648
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
Abstract（参考訳）: 事前トレーニングされたデコーダのみの言語モデルは、大量の高品質なデータに依存している。メタデータは一般的にこれらのデータセットの作成とキュレーションに使用されるが、直接的なトレーニング信号としての可能性はまだ探索されていない。我々は,この現状に挑戦し,LIME(Linguistic Metadata Embeddings)を提案する。 LIMEはトレーニング前の効率を大幅に改善する。具体的には、トレーニングデータ分散に最大56%高速に対応し、無視可能な計算オーバーヘッドで0.01%追加パラメータを導入する。効率性以外にも、LIMEはトークン化を改善し、言語モデリング機能と生成タスクのパフォーマンスを大幅に向上させる。これらのメリットは、モデルスケール(5Mから2B)にわたって持続します。さらに,トークン生成をガイドできるメタデータシフト型 LIME+1 を開発した。次のトークンの以前のメタデータを考えると、LIME+1は推論性能を最大38%改善し、算術的精度を最大35%向上する。

論文の概要: LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

関連論文リスト