Fugu-MT 論文翻訳(概要): Baby Scale: Investigating Models Trained on Individual Children's Language Input

論文の概要: Baby Scale: Investigating Models Trained on Individual Children's Language Input

arxiv url: http://arxiv.org/abs/2603.29522v1
Date: Tue, 31 Mar 2026 10:06:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.480757
Title: Baby Scale: Investigating Models Trained on Individual Children's Language Input
Title（参考訳）: ベビースケール:個々の子どもの言語入力に基づく学習モデルの検討
Authors: Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank,
Abstract要約: 現代の言語モデルは、人間の子どもが受けるものよりも、桁違いに多くのトレーニングデータで訓練されなければならない。我々は、子どもの自然学習データから言語知識がどのように現れるかを理解するために、人間のスケールデータセット上でLMをベンチマークする。児童データに基づいて訓練されたLMは文法タスクのスケーリングを許容できるが、意味的および世界知識タスクのスケーリングは、合成データで訓練されたモデルよりも少ない。
参考スコア（独自算出の注目度）: 2.3226022042424934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.
Abstract（参考訳）: 現代の言語モデル(LM)は、有用な行動を生み出し始める前に、人間の子供よりはるかに多くの訓練データに基づいて訓練されなければならない。この「データギャップ」の性質と起源を評価するには、子どもの自然学習データから言語知識がどのように現れるかを理解するために、人間のスケールデータセット上でLMをベンチマークする必要がある。 BabyViewデータセット(6～36ヶ月のビデオ)の転写データを用いて,(1)子育てデータ体制におけるスケーリング性能,(2)異なる子どもの経験から得られたデータセット間のモデル性能の変動,(3)モデルと子どもの言語学習結果の関係について検討した。児童データに基づいて訓練されたLMは、文法タスクのスケーリングが許容できるが、意味的および世界知識タスクのスケーリングは、合成データで訓練されたモデルよりも少ない。データセットのサイズを超えて、パフォーマンスは分散言語と相互作用言語の組み合わせに最も関連付けられており、児童言語の発達に高品質な入力をもたらすものと広く一致している。最後に、個々の単語に対するモデルの可能性は、これらの単語に対する子供の学習と相関し、児童指向の入力の特性がモデル学習と人間の言語発達の両方に影響を及ぼす可能性があることを示唆する。全体として、どのような特性が言語データを学習に効果的にするかを理解することで、より強力な小規模言語モデルを実現すると同時に、人間の言語習得にも光を当てることができる。

論文の概要: Baby Scale: Investigating Models Trained on Individual Children's Language Input

関連論文リスト