Fugu-MT 論文翻訳(概要): SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

論文の概要: SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

arxiv url: http://arxiv.org/abs/2606.19646v1
Date: Wed, 17 Jun 2026 23:00:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.574781
Title: SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering
Title（参考訳）: SAFE-Cascade: チャート質問応答のためのコスト適応型ビジョンランゲージルーティング
Authors: Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu,
Abstract要約: コスト適応型チャート質問応答のための対話型システムSAFE-Cascadeを実演する。チャートイメージと自然言語の質問が与えられたとき、SAFE-CascadeはまずチャートテキストをOCRで抽出する。テキストのみの言語モデルから仮回答を取得し、学習したルータを使用して、テキスト応答を受け入れるか、VLMにエスカレートするかを決定する。
参考スコア（独自算出の注目度）: 16.639536340934715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.
Abstract（参考訳）: 視覚言語モデル(VLM)は、チャート質問応答には強力だが、多くの質問がOCRテキストや軽量言語推論から答えられる場合、全てのクエリに対してVLMを呼び出すことは不要にコストがかかる。コスト適応型チャート質問応答のための対話型システムSAFE-Cascadeを実演する。チャート画像と自然言語の質問が与えられたSAFE-Cascadeは、まずチャートテキストをOCRで抽出し、テキストのみの言語モデルから仮回答を取得し、学習ルータを使用して、テキスト応答を受け入れるか、VLMにエスカレートするかを決定する。 OCRエビデンス、テキストのみの回答、ルーティング確率、エスカレーション決定、最終回答、推定コスト、推定レイテンシが並べて表示される。 SAFE-Cascadeは、視覚的な接地が必要なときの理解のための透明なインターフェースとして設計されている。ユーザはチャートをアップロードしたり選択したり、質問したり、各経路で使われているエビデンスを調べたり、テキストのみの回答とVLMの回答を比較したり、エスカレーションしきい値を調整することで、精度の高いフロンティアを探索することができる。このシステムはAzure Document Intelligence for OCR、テキストのみのモデルとしてgpt-5-mini、VLMとしてgemini-2.5-flash-image、推論時間の特徴を訓練されたランダムフォレストルータで実装されている。 SAFE-Cascadeは、2500回の実験で375回のChartQAテストで69.1%の精度で73.1%のVLMを、67.7%の精度と100%のVLMをフルVLMベースラインで実行した。観測された+1.4パーセンテージの差は統計的に不確実であるため、SAFE-CascadeはVLMの呼び出しを26.9%削減し、コストを9.3%削減する。デモでは、選択的なモダリティルーティングによって、マルチモーダルな知識システムがより透明で、チューニング可能で、コストを意識できることを示す。

論文の概要: SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

関連論文リスト