Fugu-MT 論文翻訳(概要): End-to-end Listen, Look, Speak and Act

論文の概要: End-to-end Listen, Look, Speak and Act

arxiv url: http://arxiv.org/abs/2510.16756v1
Date: Sun, 19 Oct 2025 08:45:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.131274
Title: End-to-end Listen, Look, Speak and Act
Title（参考訳）: エンドツーエンドの聞き取り, 聞き取り, 話し取り, 行為
Authors: Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Chao Zhang,
Abstract要約: ELLSAは、より自然で一般的な対話型人工知能への一歩であり、人工知能の幅広い追求に寄与している。中心となるのはSA-MoE(Attention Mixture-of-Experts)で、それぞれのモダリティを専門の専門家にルーティングすることで、統一された注意バックボーンを通じてそれらを融合させる。
参考スコア（独自算出の注目度）: 22.047534228540783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.
Abstract（参考訳）: 人間のインタラクションは本質的にはマルチモーダルでフルデュプレックスで、監視しながら耳を傾け、行動しながら話し、ターンテイキングや中断に流動的に適応します。これらの能力を実現することは、人間をシミュレートするモデルを構築するのに不可欠である。 ELSA(End-to-end Listen, Look, Speak and Act)は、私たちの知る限り、単一のアーキテクチャ内で視覚、テキスト、スピーチ、アクションを同時に知覚し、生成し、より自然で人間的な振る舞いをもたらす最初のフル・ダブル・エンド・ツー・エンド・モデルである。コアとなるSA-MoEアーキテクチャ(Self-Attention Mixture-of-Experts)は、各モダリティを専門の専門家にルーティングし、それらを統一された注目バックボーンを通じて融合させるものだ。これにより、結合マルチモーダル知覚と同時生成のための一般化可能なソリューションが提供され、強力な事前学習されたコンポーネントを活用しながら、効率的なモダリティ積分とモダリティ干渉の緩和を実現している。音声インタラクションとロボット操作のベンチマークでは、ELSAはモダリティ固有のベースラインと一致し、対話やアクションのターンテイク、欠陥のある命令拒否、発話時動作、コンテキストグラウンドの視覚的質問応答、アクションバージインといった高度なマルチモーダルおよびフルダブルな動作を独自にサポートしている。 ELSAは、より自然的で汎用的な知性への一歩であり、人工知能の幅広い追求に寄与している、と我々は主張する。すべてのデータ、コード、モデルチェックポイントは、受け入れ次第リリースされます。

論文の概要: End-to-end Listen, Look, Speak and Act

関連論文リスト