Fugu-MT 論文翻訳(概要): E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

論文の概要: E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

arxiv url: http://arxiv.org/abs/2604.04834v1
Date: Mon, 06 Apr 2026 16:35:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.285823
Title: E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
Title（参考訳）: E-VLA:ダークシーンとブラインドシーンのイベント拡張ビジョン・ランゲージ・アクションモデル
Authors: Jiajun Zhai, Hao Shi, Shangwei Guo, Kailun Yang, Kaiwei Wang,
Abstract要約: E-VLAは、従来のフレームベースの視覚が信頼できないときに、操作の堅牢性を改善するイベント拡張VLAフレームワークである。 DAVIS346イベントカメラを用いたオープンソースの遠隔操作プラットフォームを構築し,実世界のRGBイベント処理データセットを収集する。 E-VLAは、イベント駆動の知覚がVLAモデルに効果的に統合できるという系統的な証拠を提供する。
参考スコア（独自算出の注目度）: 38.08824111103771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.
Abstract（参考訳）: ロボットビジョン・ランゲージ・アクション(VLA)モデルは、開放的な操作のためによく一般化されるが、極低照度、動きのぼやけ、黒いクリッピングのような知覚段階の劣化の下では、その知覚は脆弱である。 E-VLAは、従来のフレームベースの視覚が信頼できないときに、操作の堅牢性を改善するイベント拡張VLAフレームワークである。イベントからのイメージを再構成する代わりに、E-VLAはイベントストリーム内の動きと構造的手がかりを直接利用して、有害な条件下での意味的知覚と知覚-行動の一貫性を維持する。我々は,DAVIS346イベントカメラを用いたオープンソースの遠隔操作プラットフォームを構築し,様々なタスクと照明設定にまたがる実世界のRGBイベント操作データセットを収集する。また、軽量で事前訓練された互換性のあるイベント統合戦略を提案し、安定したデプロイメントのためのイベントウィンドウと融合を研究します。実験では、RGB画像に蓄積されたイベントマップをオーバーレイする単純なパラメータフリー融合であっても、20ラックスのピックプレイスでは、オーバーレイフュージョンで0%(画像のみ)から60%に、イベントアダプタで90%に成長し、激しい動きのぼけ(1000ミリ秒露光)の下で、ピックプレイスが0%から20-25%に改善し、ソーティングが5%から32-5%に改善した。全体として、E-VLAは、イベント駆動の知覚がVLAモデルに効果的に統合できるという体系的な証拠を提供し、従来のフレームベースイメージング以上の堅牢なインテリジェンスを指し示している。コードとデータセットはhttps://github.com/JJayzee/E-VLA.comで入手できる。

論文の概要: E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

関連論文リスト