Related papers: Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

URL: http://arxiv.org/abs/2603.02655v1
Date: Tue, 03 Mar 2026 06:39:04 GMT
Title: Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
Authors: Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki,
Abstract summary: We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed.<n>We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach.<n> Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone.
Score: 69.57389826203699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.

Related papers

Commentary Generation for Soccer Highlights [0.0]
We extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset.<n>We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup.<n>Our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance.
arXiv Detail & Related papers (2025-08-11T01:48:37Z)
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z)
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation [13.835968474349034]
TimeSoccer is the first end-to-end soccer MLLM for Single-anchor Video Captioning (SDVC) in full-match soccer videos.<n>TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling.<n>MoFA-Select is a training-free, motion-aware frame compression module that adaptively selects representative frames.
arXiv Detail & Related papers (2025-04-24T08:27:42Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z)
Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description. Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z)
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.