ChronusOmni: Improving Time Awareness of Omni Large Language Models
- URL: http://arxiv.org/abs/2512.09841v1
- Date: Wed, 10 Dec 2025 17:22:42 GMT
- Title: ChronusOmni: Improving Time Awareness of Omni Large Language Models
- Authors: Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru,
- Abstract summary: Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions.<n>We propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding.<n>We construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task.
- Score: 29.685563616290352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
Related papers
- Factorized Learning for Temporally Grounded Video-Language Models [81.13591807802652]
Two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy.<n>We propose D$2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency.
arXiv Detail & Related papers (2025-12-30T09:13:20Z) - Not in Sync: Unveiling Temporal Bias in Audio Chat Models [59.146710538620816]
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning.<n>We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction.
arXiv Detail & Related papers (2025-10-14T06:29:40Z) - Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z) - Game-Time: Evaluating Temporal Dynamics in Spoken Language Models [93.844257719952]
We introduce the Game-Time Benchmark framework to assess temporal capabilities.<n>Our evaluation of diverse SLM models reveals a clear performance disparity.<n>The GameTime Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI.
arXiv Detail & Related papers (2025-09-30T15:23:39Z) - Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times [0.0]
We introduce the textbfPerfect Times dataset, a quadrilingual (English, Italian, Russian, and Japanese) multiple-choice question-answering benchmark designed to assess video-language models (VLMs) on temporal reasoning.<n> Experimental results indicate that state-of-the-art models, despite their success on text-based tasks, struggle to mirror human-like temporal and causal reasoning grounded in video.
arXiv Detail & Related papers (2025-06-01T09:45:41Z) - On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.<n>We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z) - VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models [27.280311932711847]
We present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding.
We first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects.
We generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect.
arXiv Detail & Related papers (2023-11-29T07:15:34Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Temporal and cross-modal attention for audio-visual zero-shot learning [38.02396786726476]
generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information.
We propose a multi-modal and Temporal Cross-attention Framework (modelName) for audio-visual generalised zero-shot learning.
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the ucf, vgg, and activity benchmarks for (generalised) zero-shot learning.
arXiv Detail & Related papers (2022-07-20T15:19:30Z) - AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention
Estimation for Non-Profilic Faces [28.245662058349854]
In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces.
We use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision.
Our model can utilize any of the available modalities for task-specific inference.
arXiv Detail & Related papers (2022-07-07T02:23:02Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.