Fugu-MT 論文翻訳(概要): Context-Length Robustness in Question Answering Models: A Comparative Empirical Study

論文の概要: Context-Length Robustness in Question Answering Models: A Comparative Empirical Study

arxiv url: http://arxiv.org/abs/2603.15723v1
Date: Mon, 16 Mar 2026 17:14:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:06.911406
Title: Context-Length Robustness in Question Answering Models: A Comparative Empirical Study
Title（参考訳）: 質問応答モデルにおける文脈長ロバスト性:比較実証的研究
Authors: Trishita Dhara, Siddhesh Sheth,
Abstract要約: 本稿では,SQuADとHotpotQAの2つのベンチマークを用いて,大規模言語モデルにおける文脈長頑健性の実証的研究を行った。モデル精度を全文脈長の関数として評価し,応答を含む信号を保持しながら,無関係な文脈の量を体系的に増加させることで評価する。その結果、コンテキスト長が増加するにつれて性能が一貫した低下を示し、マルチホップ推論タスクではシングルスパン抽出タスクよりもはるかに大きな低下が観測された。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA exhibits nearly twice the accuracy degradation of SQuAD under equivalent context expansions. These findings highlight task-dependent differences in robustness and suggest that multi-hop reasoning is especially vulnerable to context dilution. We argue that context-length robustness should be evaluated explicitly when assessing model reliability, especially for applications involving long documents or retrieval-augmented generation.
Abstract（参考訳）: 大きな言語モデルは、関連する情報が長く騒々しいコンテキストに埋め込まれた設定にますますデプロイされる。それにもかかわらず、コンテキスト長の増加に対する堅牢性は、異なる質問応答タスク間では理解されていない。本研究では、SQuADとHotpotQAという2つの広く使われているベンチマークを用いて、大規模言語モデルにおける文脈長頑健性に関する制御された実証的研究を示す。モデル精度を全文脈長の関数として評価し,応答を含む信号を保持しながら,無関係な文脈の量を体系的に増加させることで評価する。これにより、タスクの難易度の変化からコンテキスト長の影響を分離できる。その結果、コンテキスト長が増加するにつれて性能が一貫した低下を示し、マルチホップ推論タスクではシングルスパン抽出タスクよりもはるかに大きな低下が観測された。特にHotpotQAは、等価な文脈展開下でのSQuADの精度の約2倍の劣化を示す。これらの結果から,マルチホップ推論が特にコンテキスト希釈に弱いことが示唆された。我々は、特に長期文書や検索拡張世代を含むアプリケーションにおいて、モデル信頼性を評価する際に、文脈長の頑健さを明示的に評価する必要があると論じている。

論文の概要: Context-Length Robustness in Question Answering Models: A Comparative Empirical Study

関連論文リスト