Fugu-MT 論文翻訳(概要): DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

論文の概要: DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

arxiv url: http://arxiv.org/abs/2606.17574v1
Date: Tue, 16 Jun 2026 06:22:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.308453
Title: DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack
Title（参考訳）: DeepInsight: 物理的なAIスタック全体にわたる統一された評価インフラストラクチャ
Authors: Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen,
Abstract要約: 物理AIスタックの評価は、3桁以上異なる演算子にまたがる。既存のフレームワークはこの範囲にはないため、スタックは別々のハーネスを縫い合わせることで評価されている。私たちは、この完全なスペクトルを単一のランタイムで提供する評価インフラストラクチャであるDeepInsightを紹介します。
参考スコア（独自算出の注目度）: 17.770542038652568
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.
Abstract（参考訳）: 物理AIスタックの評価は、単一の基礎モデルデコードステップから、全身制御の何千もの物理学のダニまで、桁違いに3桁以上異なる演算子にまたがる。既存のフレームワークがこの範囲にまたがることはないので、スタックは実行時もスコアも共有しない別々のハーネスを縫い合わせ、各セグメントのローカルな妥当性を保ちながら、層間回帰を診断するために必要な共有IDを失うことで評価される。私たちは、この完全なスペクトルを単一のランタイムで提供する評価インフラストラクチャであるDeepInsightを紹介します。それぞれが,すべてのサブシステムで共有される1つの不変量 – ひとつのエピソードドライバ,高価なバックエンド(LLM推論やサンドボックスランタイムなど)で実装された1つのリソースハンドルプロトコル,すべてのイベントが記述される1つのトレースIDスキーム – として実現されている。具体化されたヒューマノイドスタックの3つのレイヤにまたがって本番環境にデプロイされるこの単一の不変セットは、主に構成によって新しいベンチマークに載っている。成熟したピアオーケストレータが存在する -- ファンデーション・モデル・エンド -- は、公開参照とピア・フレームの読み込みを自身のスプレッド内で再現し、単一のノード上で同じスイートを高速に実行し、ノード間でほぼ直線的にスケールする。すべてのレイヤがひとつの共有トレースに書き込むため、別のレイヤで始まる回帰は、そのトレース上でローカライズできる。

論文の概要: DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

関連論文リスト