Related papers: OmniGAIA: Towards Native Omni-Modal AI Agents

OmniGAIA: Towards Native Omni-Modal AI Agents

URL: http://arxiv.org/abs/2602.22897v2
Date: Sat, 28 Feb 2026 12:27:12 GMT
Title: OmniGAIA: Towards Native Omni-Modal AI Agents
Authors: Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, Zhicheng Dou,
Abstract summary: We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
Score: 103.79729735478924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

Related papers

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding [23.176694412214157]
We introduce OmniAgent, a fully audio-guided active perception agent.<n>This paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry.
arXiv Detail & Related papers (2025-12-29T17:59:05Z)
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM [146.029449832893]
We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM.<n>For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings.
arXiv Detail & Related papers (2025-10-17T17:59:59Z)
OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation [49.66156306240961]
We present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation.<n>Our approach leverages a high-capacity vision-language-action backbone and trains with three primary goal modalities.<n>We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks.
arXiv Detail & Related papers (2025-09-23T18:40:29Z)
Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z)
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models. evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.