Fugu-MT 論文翻訳(概要): CRAwLeR -- Cross-Reference Aware Legal Retrieval

論文の概要: CRAwLeR -- Cross-Reference Aware Legal Retrieval

arxiv url: http://arxiv.org/abs/2606.21676v1
Date: Fri, 19 Jun 2026 18:32:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 04:04:40.864731
Title: CRAwLeR -- Cross-Reference Aware Legal Retrieval
Title（参考訳）: CRAwLeR -- 相互参照を意識した法的検索
Authors: Maciej Jalocha, William Michelsen,
Abstract要約: コンテキスト対応のチャンク検索のための既存のベンチマークは、再利用されたタスクアイテムに大きく依存している。我々は、特定の種類の文脈依存、法的相互参照に焦点を当て、CRAwLeRを導入します。我々のパイプラインは、法的相互参照を検出し、クエリ候補を特定し、ターゲットチャンクを関連するコンテキストにリンクし、コンテキスト要求クエリを生成する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing benchmarks for context-aware chunk retrieval rely heavily on repurposed task items and rarely demonstrate that their queries genuinely require context, making score interpretation difficult. We focus on a specific kind of context dependence, legal cross-references, and introduce CRAwLeR, an operationalization of a narrow, well-defined phenomenon: cross-reference-aware context utilization for chunk retrieval in legal documents. Our pipeline detects legal cross-references, identifies query candidates, links target chunks to their relevant context, generates context-demanding queries with an LLM, and filters them through both an adversarial non-contextual baseline and an assurance prompt. We release CRAwLeR-DK and CRAwLeR-PL, Danish and Polish datasets built with this pipeline, alongside a strong Anthropic-style contextualization baseline. Manual analysis finds that approximately 80% of randomly sampled queries genuinely target the labelled target chunk and require context, with failures following systematic and named patterns. The benchmarks are hard but not solved: best Recall@10 reaches 55% on CRAwLeR-DK and 59% on CRAwLeR-PL. Ablation and failure analysis attribute the remaining gap to the contextualising LLM, not the retriever. Even when the target is retrieved in the top ten, labelled context chunks routinely outrank it. We are the first dataset for context-aware chunk retrieval to carefully consider construct validity and inspect our results in the light of such a narrow, well-defined phenomenon.
Abstract（参考訳）: 既存の文脈対応チャンク検索のベンチマークは、再利用されたタスク項目に大きく依存しており、クエリが実際にコンテキストを必要とすることを示すことは滅多になく、スコアの解釈を困難にしている。我々は、特定の種類のコンテキスト依存、法的相互参照、および狭義の明確に定義された現象の運用化であるCRAwLeRの導入に焦点をあてる。我々のパイプラインは、法的相互参照を検出し、クエリ候補を特定し、ターゲットチャンクを関連するコンテキストにリンクし、LLMでコンテキスト要求クエリを生成し、敵の非コンテキストベースラインと保証プロンプトの両方を通してフィルタリングする。我々は、このパイプラインで構築されたデンマークとポーランドのデータセットであるCRAwLeR-DKとCRAwLeR-PLを、強力なArthhropicスタイルのコンテキスト化ベースラインとともにリリースする。手動分析では、ランダムにサンプリングされたクエリの約80%が、ラベル付けされたターゲットチャンクを真にターゲットとし、コンテキストを必要とする。 Best Recall@10はCRAwLeR-DKで55%、CRAwLeR-PLで59%に達する。アブレーションと故障解析は、残りのギャップが検索者ではなく文脈的LLMに起因している。ターゲットがトップ10で検索されたとしても、ラベル付きコンテキストチャンクは日常的にそれを上回る。我々はコンテキスト対応のチャンク検索のための最初のデータセットであり、構築の妥当性を慎重に検討し、そのような狭く明確に定義された現象に照らして結果を検査する。

論文の概要: CRAwLeR -- Cross-Reference Aware Legal Retrieval

関連論文リスト