Moment and Highlight Detection via MLLM Frame Segmentation
- URL: http://arxiv.org/abs/2512.12246v1
- Date: Sat, 13 Dec 2025 09:11:59 GMT
- Title: Moment and Highlight Detection via MLLM Frame Segmentation
- Authors: I Putu Andika Bagas Jiwanta, Ayu Purwarianti,
- Abstract summary: We propose a novel method to detect video moments and highlights from natural-language queries.<n>Our method scores above the baseline (35.28 MAP) for moment retrieval.
- Score: 3.7199696376626457
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
Related papers
- Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs [88.68484904214142]
We introduce Patch-as-Decodable Token (PaDT) to generate both textual and diverse visual outputs.<n>Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images.<n>We show PaDT consistently achieves state-of-the-art performance, even compared with significantly larger MLLM models.
arXiv Detail & Related papers (2025-10-02T12:23:57Z) - Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding [47.400649582392255]
We use large language models (MLLMs) to explore a zero-shot solution in STVG.<n>We propose a MLLM-based zero-shot framework for STVG, which includes novel temporal-augmented assembling strategies.
arXiv Detail & Related papers (2025-09-18T17:35:50Z) - END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - Efficient Real-time Refinement of Language Model Text Generation [65.1937138219008]
Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks.<n>A critical challenge remains in that they sometimes generate factually incorrect answers.<n>We propose Streaming-VR, a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs.
arXiv Detail & Related papers (2025-01-14T03:59:48Z) - SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It) [16.673210422615348]
More than 10 new methods have been proposed to perform Membership Inference Attacks (MIAs) against LLMs.<n>Contrary to traditional MIAs which rely on fixed-but randomized-records or models, these methods are mostly trained and tested on datasets collected post-hoc.<n>This lack of randomization raises concerns of a distribution shift between members and non-members.
arXiv Detail & Related papers (2024-06-25T23:12:07Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.<n>We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning [59.88924847995279]
We propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF.<n>To reduce the distribution discrepancy, we develop the cross-modal match module.<n>CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks.
arXiv Detail & Related papers (2024-03-12T04:04:38Z) - Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning [23.932500424117244]
In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs)
Previous studies have shown that using LLMs' outputs as labels is effective in training models to select demonstrations.
This paper presents an analysis on different utility functions by focusing on LLMs' output probability given ground-truth output.
arXiv Detail & Related papers (2023-11-16T07:03:54Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Transcormer: Transformer for Sentence Scoring with Sliding Language
Modeling [95.9542389945259]
Sentence scoring aims at measuring the likelihood of a sentence and is widely used in many natural language processing scenarios.
We propose textitTranscormer -- a Transformer model with a novel textitsliding language modeling (SLM) for sentence scoring.
arXiv Detail & Related papers (2022-05-25T18:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.