Related papers: RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

URL: http://arxiv.org/abs/2506.01934v1
Date: Mon, 02 Jun 2025 17:53:10 GMT
Title: RoboEgo System Card: An Omnimodal Model with Native Full Duplexity
Authors: Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Aixin Sun, Yequan Wang,
Abstract summary: RoboEgo (alias: FLM-Ego) is a unified model system designed to address both challenges.<n>FLM-Ego incorporates backbone and algorithms that support fullity, achieving a theoretical duplex of 80 ms latency.
Score: 48.52383812141669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.

Related papers

OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z)
Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything [12.274140974616747]
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs.<n>We propose an Agent- Omni framework that coordinates existing foundation models through a master-agent system.
arXiv Detail & Related papers (2025-11-04T18:59:09Z)
End-to-end Listen, Look, Speak and Act [22.047534228540783]
ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial intelligence.<n>At its core is a novel SA-MoE (Attention Mixture-of-Experts) that routes each modality to specialized experts fuses them through a unified attention backbone.
arXiv Detail & Related papers (2025-10-19T08:45:46Z)
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z)
Aya Vision: Advancing the Frontier of Multilingual Multimodality [15.981889066681424]
We develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data.<n>We also propose a cross-modal model merging technique that mitigates catastrophic forgetting.<n>Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute.
arXiv Detail & Related papers (2025-05-13T17:03:48Z)
Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z)
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models. evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z)
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [0.0]
Mini- Omni2 is a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset.
arXiv Detail & Related papers (2024-10-15T02:10:45Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.<n>SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.<n>We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z)
AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework [21.10693332367192]
We present AllSpark, a multimodal-temporal general artificial intelligence model.<n>Our model integrates ten different modalities into a unified framework.<n> Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks.
arXiv Detail & Related papers (2023-12-31T17:21:02Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA. We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.