Fugu-MT 論文翻訳(概要): Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

論文の概要: Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

arxiv url: http://arxiv.org/abs/2510.25187v1
Date: Wed, 29 Oct 2025 05:38:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-30 15:50:45.098368
Title: Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction
Title（参考訳）: 次文予測を用いたLLMにおける言語間テキスト理解の検証
Authors: Ritesh Sunil Chavan, Jack Mostow,
Abstract要約: イングリッシュ、スワヒリ、ハウサのそれぞれ1万の質問でベンチマークを作成しました。 GPT-4 Turbo、Gemini 1.5 Flash、LLaMA 3 70Bなど、いくつかのトップモデルをテストしました。全てのモデルは英語で優れているが、スワヒリ語では精度が低下し、ハウサ語では急落し、LLaMA 3が最も苦戦した。
参考スコア（独自算出の注目度）: 2.191505742658975
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.
Abstract（参考訳）: 大きな言語モデルは大量のデータセットで訓練されているが、このデータは英語に大きく歪められている。その素晴らしいパフォーマンスは、真の能力を反映しているか、それとも単にこのデータ優位性を反映しているのか? 調査のために、低リソースの言語であるデータ量に依存しない環境でテストしました。テストにNext Sentence Prediction(NSP)を使用したAgarwal et al(2025年)に基づいて、私たちは、英語(高リソース言語)、スワヒリ語(medium-resource)、ハウサ語(low-resource)の各1万の質問に対して、大規模なベンチマークを作成しました。次に、GPT-4 Turbo、Gemini 1.5 Flash、LLaMA 3 70Bなど、いくつかのトップモデルを試して、パフォーマンスがどう向上するかを確認しました。結果は、言語リソースのレベルが結果にどのように影響するかを明確にした。全てのモデルは英語で優れているが、スワヒリ語では精度が低下し、ハウサ語では急落し、LLaMA 3が最も苦戦した。 CoT(Chain-of-Thought)プロンプトの導入によって、この話はさらに興味深いものになりました。苦しいLLaMA3では、CoTが有用なガイドとして機能し、精度を大幅に向上させた。しかし、より有能な GPT-4 と Gemini では、同じ手法がしばしば逆発し、言語間の文脈で結果を傷つける「過度な考え」に繋がった。このことは、Chain-of-Thoughtが普遍的な解決策ではないことを明らかにし、その効果はモデルのベースライン能力とタスクの特定のコンテキストに大きく依存している。我々のフレームワークは、LCMの弱点を指摘し、CoTが言語間NSPのパフォーマンスを助けたり妨げたりすること、そして意思決定に影響を与える要因を強調します。

論文の概要: Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

関連論文リスト