Fugu-MT 論文翻訳(概要): Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

論文の概要: Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

arxiv url: http://arxiv.org/abs/2606.14723v1
Date: Sun, 31 May 2026 05:56:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-21 20:00:42.743779
Title: Disagreement-Based Cross-Model Routing for Implicit Video Question Answering
Title（参考訳）: 疑似ビデオ質問応答のための解答に基づくクロスモデルルーティング
Authors: Durga Sandeep Saluru,
Abstract要約: 我々はImplicitQAベンチマークを用いて,複数選択のビデオ質問応答について検討した。このベンチマークでは、単一のフロンティアビデオLLMが、その精度の天井付近ですでに動作している。ラベルやトレーニングを必要とせず、純粋な推論時間である、不一致に基づくクロスモデルルーティングを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion & Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.
Abstract（参考訳）: 本稿では,ImplicitQAベンチマークを用いて,画面外イベント,ラインオブスルー,因果構造,ショット間空間レイアウトから正解を推測する。このベンチマークでは、単一のフロンティアビデオLLMが、その精度の天井付近ですでに動作しており、従来の自己整合性戦略 -- 同じモデルの繰り返しサンプルを多数投票する — が、難しい質問に対するモデルのエラーが相関しているため、助けではなく害になる可能性があることを観察しています。ラベルやトレーニングを必要とせず、純粋な推論時間である、不一致に基づくクロスモデルルーティングを提案する。温度0でネイティブビデオモデル(Gemini 3.1 Pro Preview)をトリプルサンプリングし、ビデオ処理パイプラインの真のサンプルとサンプルのばらつきを利用して、3つのサンプルが一致しない質問の約20%のサブセットを特定し、適応的思考で一様にサンプリングされたフレームを消費する別のファミリー(Claude Opus 4.8)から、そのサブセットのみを第2モデルにルーティングする。 The 1001-question validation set with public ground truth -- the main evaluation -- the method -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains focused on Motion & Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories based on cross-shot reference resolution。 172-question CVPR 2026 ImplicitQA チャレンジテストセットに適用される同じパイプラインは、82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model)を達成し、独立した分割で検証結果を確認する。

関連論文リスト

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses [0.0]
FailureScopeは、クロスモデルパス/フェイルパターンによる評価プローブをクラスタ化する行動診断手法である。通常、シングルターン・ベンチマーク、マルチターン・ダイアログ、敵エージェント・アタックの3つのレシスタンスに対して安定かつ解釈可能な障害をもたらすことを示す。
論文参考訳（メタデータ） (2026-06-03T01:28:00Z)
Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution [0.0]
予測市場は、不確実な出来事を予測するために集合的なインテリジェンスを集約する。既存のオラクルシステムは、高速だが不安定な自動化と、正確だがコストのかかる人間の仲裁とをトレードオフする。マルチエージェントLLMアーキテクチャが単一モデルベースラインよりもオラクル分解能を向上できるかどうかを評価する。
論文参考訳（メタデータ） (2026-05-29T03:44:19Z)
Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses [6.004875368104112]
大規模な言語モデル(LLM)は治療として提案されているが、完全なサーベイワークフロー全体にわたる厳密な評価はほとんど残っていない。アンケート設計, サンプル選択, パイロットテスト, 欠落データ計算, および収集後の分析を対象とする, LLM 統合のための5段階フレームワークを提示し, 評価した。保護モチベーション理論 (PMT) 制約付き共起知識グラフを導入し, ゼロショット推論, 検索拡張ベースライン, 新規な理論インフォームド変種にまたがる7つのLLM構成を開発する。
論文参考訳（メタデータ） (2026-05-19T00:58:36Z)
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
本稿では,安価な拡散サンプリング推論をステップレベル候補の再利用プールに変換する自己整合性フレームワークを提案する。ステップレベルの再結合は、難しい問題に対して最も有益であることがわかった。トレーニング不要のフレームワークは、6つの数学およびコーディングタスクの平均精度を最大2倍改善します。
論文参考訳（メタデータ） (2026-02-26T11:08:39Z)
Sharp Convergence Rates for Masked Diffusion Models [53.117058231393834]
制約を克服するオイラー法に対する全変分に基づく解析法を開発した。その結果、スコア推定の仮定を緩和し、パラメータ依存性を改善し、収束保証を確立する。全体としては,CTMC軌道に沿った直接テレビによる誤り分解と,FHSのためのデカップリングに基づく経路解析を導入している。
論文参考訳（メタデータ） (2026-02-26T00:47:51Z)
ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces [3.151184728006369]
本稿では,聴覚条件下でのマルチモデルオーケストレーションのための測定フレームワークACARを提案する。 ACARは、N=3プローブサンプルから計算した自己整合分散(sigma)を使用して、単一モデル、2モデル、3モデル実行モードでタスクをルーティングする。我々は4つのベンチマークにまたがる1,510のタスクに対してACARを評価し、7,550以上の監査可能な実行を生成した。
論文参考訳（メタデータ） (2026-02-06T23:27:17Z)
LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning [73.90466023069125]
ビデオクリップに適応的にズームイン可能なモデルであるLOVE-R1を提案する。モデルはまず、密度の高いサンプルフレームが提供されるが、小さな解像度で提供される。空間的詳細が必要な場合、大きなフレーム解像度で興味のあるクリップを拡大することができる。
論文参考訳（メタデータ） (2025-09-29T13:43:55Z)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.34900892130929]
モデルから候補解を繰り返しサンプリングする簡単な手法を用いて、推論計算をスケーリングのための別の軸として検討する。複数のタスクやモデルにまたがって、カバレッジは4桁以上のサンプル数でスケールする。コードや形式的証明のようなドメインでは、回答が自動的に検証されるので、カバレッジの増加は直接的にパフォーマンスの向上につながります。
論文参考訳（メタデータ） (2024-07-31T17:57:25Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。