Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
- URL: http://arxiv.org/abs/2503.01879v2
- Date: Fri, 07 Mar 2025 09:21:40 GMT
- Title: Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
- Authors: Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Haohan Li, Yu Lu, Shilin Zhou, Yue Lu, Ziliang Gan, Ziao Wang, Junwei Liao, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai,
- Abstract summary: We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data.<n>We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities?<n>Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios?<n>Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
- Score: 50.23246260804145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human beings perceive the real world through a spectrum of sensory modalities, encompassing auditory, visual, and linguistic faculties. The journey towards achieving Artificial General Intelligence (AGI) necessitates the development of models that can emulate these multifaceted perceptual capabilities and comprehensively understand these diversified data. To this end, we introduce \textbf{Nexus-O}, an industry-level \textbf{omni-perceptive and -interactive} model capable of efficiently processing Audio, Image, Video, and Text data in any combination and output audio/text in an end-to-end way. We systematically investigate Nexus-O by addressing three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario speech datasets? For the first question, we design and pre-train Nexus-O based on the vision-language model, rather than the language model. By pre-training the model over high-quality synthetic audio data, our model is capable of tri-modal perception and interaction. For the second question, we introduce a new audio testbed, Nexus-O-audio, comprising diverse Automatic Speech Recognition (ASR) samples, spanning various real-world scenarios, such as corporate meetings and live stream. For the third question, we design the speech data synthesis pipeline to obtain high-quality speech training datasets, covering various real-world scenarios. Comprehensive experimentation and an in-depth analysis of tri-modal alignment over latent space demonstrate the advantages of our model on downstream tasks.
Related papers
- Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment [88.72389428177942]
Ola is an omni-modal language model that achieves competitive performance across image, video, and audio understanding.
We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z) - OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis [68.73476738779628]
name is a two-stage training framework that integrates omnimodal alignment and speech generation.
It surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks.
name achieves real-time speech generation with 1s latency at non-autoregressive mode.
arXiv Detail & Related papers (2025-01-08T15:18:09Z) - ETTA: Elucidating the Design Space of Text-to-Audio Models [33.831803213869605]
We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks.<n>We propose our best model dubbed Elucidated Text-To-Audio (ETTA)<n>ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
arXiv Detail & Related papers (2024-12-26T21:13:12Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks,
Methods, and Applications [20.842799581850617]
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers.
arXiv Detail & Related papers (2023-11-30T01:14:43Z) - FALL-E: A Foley Sound Synthesis Model and Strategies [0.5599792629509229]
The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder.
We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input.
arXiv Detail & Related papers (2023-06-16T12:44:10Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset [34.38377548121313]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.<n>VALOR jointly models relationships of vision, audio and language in an end-to-end manner.<n>It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.