Fugu-MT 論文翻訳(概要): QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

論文の概要: QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

arxiv url: http://arxiv.org/abs/2604.25884v1
Date: Tue, 28 Apr 2026 17:28:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.974768
Title: QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Title（参考訳）: QCalEval: 量子校正スロット理解のためのビジョンランゲージモデルのベンチマーク
Authors: Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov, Daniel C. Cole, Alejandro Gómez Frieiro, Elena O. Glen, Hao Hsu, Gang Huang, Raymond Jow, Greshma Shaji, Tom Lubowe, Ligeng Zhu, Luis Mantilla Calderón, Nicola Pancotti, Joel Pendleton, Brandon Severin, Charles Etienne Staub, Sara Sussman, Antti Vepsäläinen, Neel Rajeshbhai Vora, Yilun Xu, Varinia Bernales, Daniel Bowring, Elica Kyoseva, Ivan Rungger, Giulia Semeghini, Sam Stanwyck, Timothy Costa, Alán Aspuru-Guzik, Krysta Svore,
Abstract要約: 量子キャリブレーションプロットのための最初のVLMベンチマークであるQCalEvalを紹介する。超伝導量子ビットと中性原子にまたがる22の実験系から87種類のシナリオタイプにまたがる243試料について検討した。最高の汎用ゼロショットモデルは平均スコア72.3に達し、多くのオープンウェイトモデルはマルチイメージのインコンテキスト学習で劣化する。
参考スコア（独自算出の注目度）: 37.18078731710843
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.
Abstract（参考訳）: 量子コンピューティングのキャリブレーションは、実験データの解釈に依存し、キャリブレーションプロットは、このタスクにおいて最も普遍的な人間可読表現を提供するが、視覚言語モデル(VLM)がいかにうまく解釈するかの体系的な評価は存在しない。我々は、量子キャリブレーションプロットのための最初のVLMベンチマークであるQCalEvalを紹介した。22の実験系から87のシナリオタイプにまたがる243のサンプルは、超伝導量子ビットと中性原子にまたがっており、ゼロショットとインコンテキストの両方の学習環境で6つの質問タイプで評価されている。最高の汎用ゼロショットモデルは平均スコア72.3に達し、多くのオープンウェイトモデルはマルチイメージのインコンテキスト学習で劣化する一方、フロンティアクローズドモデルは大幅に改善される。 9-billion-parameterスケールでの教師付き微調整アブレーションは、SFTがゼロショット性能を改善するが、マルチモーダルなインコンテキスト学習ギャップを閉じることができないことを示す。参考ケーススタディとして,Qwen3.5-35B-A3BをベースとしたオープンウェイトモデルであるNVIDIA Ising Calibration 1を74.7ゼロショット平均スコアでリリースする。

関連論文リスト

MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models [0.0]
MIRRORは、大規模言語モデルがより優れた意思決定に自己知識を使用できるかどうかを評価するベンチマークである。約25万の評価インスタンスに対して,8つの実験室から16のモデルを評価した。
論文参考訳（メタデータ） (2026-04-15T08:41:12Z)
VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images [0.0]
本稿では、視覚言語モデルが写真で何が見えていないのかを判断できるかどうかを判定するベンチマークであるVBを提案する。アイテムは、最小限の画像編集を最小限のテキスト編集で横断する2x2デザインを使用して、100のファミリーに編成される。我々は,自信認識精度(CAA),最小編集フリップ率(MEFR),信頼ランク選択予測(SelRank),第2次視点推論のモデルを評価する。
論文参考訳（メタデータ） (2026-03-03T23:03:11Z)
UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment [17.430091762374744]
主観的知覚タスクでは、このアライメントはモデルトレーニングなしで実現できることを示す。密結合した3つの段階からなる訓練不要なポストホック・コンセプト・ブートネックパイプラインを提案する。
論文参考訳（メタデータ） (2026-02-23T02:24:55Z)
Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets [0.0]
十分に校正されたモデルは、実際の正確さと一致した信頼を表現するべきです -- 80%の信頼性を主張する場合には、80%の時間を正すべきです。我々はCFTCが規制する取引所であるKalshiから300の予測市場質問のベンチマークであるtextbfKalshiBenchを紹介した。我々は、Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, Kimi-K2 の5つのフロンティアモデルを評価し、全モデルにまたがるテキストの過信を求める。
論文参考訳（メタデータ） (2025-12-17T23:23:06Z)
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models [37.699828966838986]
BridgeVLAは、3D入力を複数の2D画像に投影し、VLMバックボーンとの入力アライメントを保証する新しい3D VLAモデルである。アクション予測に2Dヒートマップを使用し、一貫した2次元画像空間内の入力空間と出力空間を統一する。 10以上のタスクで96.8%の成功率を達成することができ、1タスクにつき3つの軌道しか持たず、異常なサンプル効率を誇示している。
論文参考訳（メタデータ） (2025-06-09T17:36:34Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [56.22507677736051]
大規模言語モデル(LLM)は、複雑な言語タスクの解決における優れたパフォーマンスを示している。 LLMを低ビットに圧縮することで、リソース制約のあるデバイスにデプロイできる。低ビット重み量子化のための最初の量子化手法である勾配対応重み量子化(GWQ)を提案する。
論文参考訳（メタデータ） (2024-10-30T11:16:04Z)
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
MM-Vet v2は、"image-text sequence understanding"と呼ばれる新しい"image-text sequence understanding"機能を含んでいる。 MM-Vet v2を用いて大規模マルチモーダルモデルのベンチマークを行った結果,Claude 3.5 Sonnetはスコア71.8の最良のモデルであり,スコア71.0のGPT-4oより若干優れていた。
論文参考訳（メタデータ） (2024-08-01T17:59:54Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。