Fugu-MT 論文翻訳(概要): Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

論文の概要: Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

arxiv url: http://arxiv.org/abs/2604.06201v1
Date: Fri, 13 Mar 2026 19:26:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-12 18:41:08.626591
Title: Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models
Title（参考訳）: Beyond Facts: 大規模言語モデルにおける分散読み込みの理解のベンチマーク
Authors: Pei-Fu Guo, Ya-An Tsai, Chun-Chia Hsu, Kai-Xin Chen, Yun-Da Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, Shou-De Lin,
Abstract要約: 本研究では,LLMが自然言語から分布的知識を推測する能力を評価するための読解的ベンチマークであるText2DistBenchを紹介する。映画と音楽のエンティティに関する実際のYouTubeコメントから構築されたこのベンチマークは、エンティティメタデータと関連するコメントを含むモデルを提供する。信頼性と長期的な評価をサポートするため、Text2DistBenchの構築パイプラインは完全に自動化され、継続的に更新され、新たに登場したエンティティが組み込まれる。
参考スコア（独自算出の注目度）: 67.09110757873142
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs' ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate newly emerging entities over time. Experiments across multiple LLMs show that while models substantially outperform random baselines, performance varies widely across different distribution types and characteristics. These findings highlight both the capabilities and limitations of current LLMs in distributional reading comprehension and demonstrate the value of Text2DistBench as a practical and scalable testbed for future research.
Abstract（参考訳）: LLMのほとんどの読解ベンチマークは、特定のテキスト証拠のローカライズによって答えられる事実情報に焦点が当てられているが、現実のタスクの多くは、集団レベルの傾向やテキストの集合に表される嗜好などの分布情報を理解する必要がある。本研究では,LLMが自然言語から分布的知識を推測する能力を評価するための読解的ベンチマークであるText2DistBenchを紹介する。映画や音楽のエンティティに関する現実のYouTubeコメントから構築されたこのベンチマークは、エンティティメタデータと関連するコメントをモデルに提供し、ポジティブなコメントとネガティブなコメントの比率を推定したり、視聴者の間で議論される最も頻繁なトピックと2番目に多いトピックを特定するなど、分散的な質問に答える必要がある。信頼性と長期的な評価をサポートするため、Text2DistBenchの構築パイプラインは完全に自動化され、継続的に更新され、新しいエンティティが時間とともに組み込まれる。複数のLSMをまたいだ実験では、モデルがランダムなベースラインを大幅に上回るが、様々な分布タイプや特性で性能が広く異なることが示されている。これらの知見は、分散読解における現在のLLMの機能と限界の両方を強調し、将来の研究のための実用的でスケーラブルなテストベッドとしてText2DistBenchの価値を実証する。

論文の概要: Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

関連論文リスト