Fugu-MT 論文翻訳(概要): Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

論文の概要: Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

arxiv url: http://arxiv.org/abs/2603.16086v1
Date: Tue, 17 Mar 2026 03:22:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.084079
Title: Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation
Title（参考訳）: ビジョン・サウンド・ランゲージ・アクション・パラダイムに向けて:音中心操作のためのHEARフレームワーク
Authors: Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang,
Abstract要約: 本稿では,視覚・ストリーミング音声・言語・プロプライエセプションを考慮した連続制御パラダイムとして,VSLA(Vision-Sound-Language-Action)を定式化した。 i) 実行ギャップをまたいだコンパクトで因果的な音声コンテキストを維持するためのストリーミングヒストリザ、(ii) オームニ基礎モデルから多感覚入力を推論するエンビジョンタ、(iii) オーディオワールドモデルとして定式化されたアドバンサ、そして(iv) 流れを予測して時間的ダイナミクスを学ぶための、VSLAフレームワークであるHEARを紹介する。
参考スコア（独自算出の注目度）: 26.766367856312694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.
Abstract（参考訳）: 近年のVision-Language-Action(VLA)モデルでは音声が組み込まれ始めているが、音を静的な事前実行プロンプトとして扱う場合や、人間の発話のみに焦点を当てる場合が多い。このことは、環境音響がタスク実行中に重要な状態検証を提供するリアルタイム、音中心の操作において、大きなギャップを残している。そのため、低周波の更新やシステム遅延のため、キーサウンドは簡単に見逃される。この問題は、オープンループ実行によるアクションチャンキングによって悪化し、個別のオーディオ観測窓間で音響イベントが失われるブラインド実行区間を生成する。連続的な聴覚認知の必要性を認識し,視覚・ストリーミング音声・言語・プロプライエセプションを遅延決定ループ下での連続的な制御パラダイムとして,VSLA(Vision-Sound-Language-Action)を定式化する。インスタンス化として,4つのコンポーネントを統合するVSLAフレームワークであるHEARを紹介します。 i) 実行ギャップを越えたコンパクトで因果的な音声コンテキストを維持するためのストリーミングヒストリザ。 (二オームニ基礎モデルから多感覚入力の推論に適合した構想者三近未来の音声を予測して時相力学を学ぶための音声世界モデルとして定式化されたアドバンサ (iv)スムーズなアクションチャンクを生成するフローマッチングリアライザポリシー。 VSLAの事前学習データや評価の不足に対処するため,厳格な因果タイミングルールを持つ最初の音中心型評価ベンチマークであるHEAR-Benchとともに,事前学習のためのOpenX-Soundを構築した。以上の結果から,頑健な音声中心の操作は因果的持続性と時間的学習を必要とすることが示唆された。このフレームワークは、ロボットが動的環境を知覚し、相互作用することを可能にする、エンボディエージェントのためのマルチ感覚基盤モデルに向けた実践的なステップを提供する。コードとビデオはhttps://hear.irmv.top.comで公開されている。

論文の概要: Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

関連論文リスト