Related papers: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

URL: http://arxiv.org/abs/2601.10323v1
Date: Thu, 15 Jan 2026 12:09:04 GMT
Title: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
Authors: Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen,
Abstract summary: We present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction.<n> ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches.<n>For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering.
Score: 32.72568710955575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

Related papers

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation [48.6868174403074]
We introduce U-Mind, the first unified system for high-intelligence multimodal dialogue.<n>It supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.<n>We show that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks.
arXiv Detail & Related papers (2026-02-27T07:07:02Z)
TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech [11.020614074201346]
We introduce TANDEM, a unified framework that transforms audio-visual hate detection into a structured reasoning problem.<n>Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other.<n>TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM.
arXiv Detail & Related papers (2026-01-16T10:52:12Z)
OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding [23.176694412214157]
We introduce OmniAgent, a fully audio-guided active perception agent.<n>This paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry.
arXiv Detail & Related papers (2025-12-29T17:59:05Z)
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [35.01134463094784]
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems.<n>Existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this.<n>This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap.
arXiv Detail & Related papers (2025-12-29T16:17:36Z)
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation [59.23161833385837]
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation.<n>Our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
arXiv Detail & Related papers (2025-12-02T18:55:53Z)
OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation [52.579531290307926]
This paper introduces OmniMotion-X, a versatile framework for whole-body human motion generation.<n> OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture.<n>To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date.
arXiv Detail & Related papers (2025-10-22T17:25:33Z)
GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval [12.483734449829235]
GAID is a framework that integrates audio and visual features under textual guidance.<n>DASP injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference.<n>Experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results with notable efficiency gains.
arXiv Detail & Related papers (2025-08-03T10:44:24Z)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.