Fugu-MT 論文翻訳(概要): Offline RL for Adaptive Policy Retrieval in Prior Authorization

論文の概要: Offline RL for Adaptive Policy Retrieval in Prior Authorization

arxiv url: http://arxiv.org/abs/2604.05125v1
Date: Mon, 06 Apr 2026 19:40:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.465443
Title: Offline RL for Adaptive Policy Retrieval in Prior Authorization
Title（参考訳）: 事前認可における適応的政策検索のためのオフラインRL
Authors: Ruslan Sharifullin, Maxim Gorshkov, Hannah Clay,
Abstract要約: 保守的なQ-Learning(CQL)、Implicit Q-Learning(IQL)、Direct Preference Optimization(DPO)を使用してポリシーをトレーニングする。 CQLは、徹底的な検索を通じて、92%の判定精度(最高の固定価格のK$ベースラインよりも30ポイント以上)を達成する。 IQLは、検索ステップを44%削減し、すべてのポリシで唯一の肯定的なリターンを達成することで、最高のベースライン精度にマッチする。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information. We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency. We train policies using Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Direct Preference Optimization (DPO) in an offline RL setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data. On a corpus of 186 policy chunks spanning 10 CMS procedures, CQL achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval, while IQL matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies. Transition-level DPO matches CQL's 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a "selective-accurate" region on the Pareto frontier that dominates both CQL and BC. A behavioral cloning baseline matches CQL, confirming that advantage-weighted or preference-based policy extraction is needed to learn selective retrieval. Lambda ablation over step costs $λ\in \{0.05, 0.1, 0.2\}$ reveals a clear accuracy-efficiency inflection: only at $λ= 0.2$ does CQL transition from exhaustive to selective retrieval.
Abstract（参考訳）: 事前承認 (PA) には、複雑かつ断片化されたカバレッジポリシーの解釈が必要であるが、既存の検索拡張システムは、検索されたセクションの固定数の静的トップ$K$戦略に依存している。このような固定された検索は非効率であり、無関係または不十分な情報を収集することができる。我々は,適応的検索をマルコフ決定プロセス(MDP)として定式化し,PAのポリシー検索を逐次決定問題としてモデル化する。本システムでは、エージェントが最上位のK$候補セットからポリシーチャンクを反復的に選択するか、あるいは停止して決定を発行するかを選択する。報酬は、検索コストに対する決定の正しさをバランスさせ、精度と効率のトレードオフを捕捉する。我々は、公開可能なCMSカバレッジデータから得られる合成PA要求に対して、ベースライン検索戦略から生成されたログ付きトラジェクトリに基づいて、オフラインのRL設定で、保守的Qラーニング(CQL)、Implicit Q-Learning(IQL)、ダイレクトプライオリティ最適化(DPO)を使用してポリシーを訓練する。 10のCMSプロシージャにまたがる186のポリシーチャンクのコーパスでは、CQLは抜本的な検索によって92%の判定精度(最高の固定値のK$ベースラインよりも30ポイント以上)を達成する一方、IQLは44%少ない検索ステップを使用して最高のベースライン精度と一致し、すべてのポリシーの中で唯一の肯定的なリターンを達成する。トランジションレベルのDPOは、CQLの92%の精度と一致し、47%の検索ステップ(10.6対20.0)を使用し、CQLとBCの両方を支配するパレートフロンティアの「選択精度」領域を占有する。行動クローンベースラインはCQLと一致し、選択的な検索を学習するために、有利な重み付けまたは優先ベースのポリシー抽出が必要であることを確認する。 Lambda ablation over step cost $λ\in \{0.05, 0.1, 0.2\}$ reveals a clear accuracy-efficiency inflection: at $λ= 0.2$ do CQL transition from exhaustive to selective search。

論文の概要: Offline RL for Adaptive Policy Retrieval in Prior Authorization

関連論文リスト