Related papers: LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

URL: http://arxiv.org/abs/2510.17305v2
Date: Tue, 21 Oct 2025 10:16:53 GMT
Title: LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
Authors: ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang,
Abstract summary: We introduce textbfLongInsightBench, the first benchmark designed to assess models' ability to understand long videos.<n>Our benchmark excels in three key areas: textbfa, textbfb, and textbfc.
Score: 19.03169157546538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

Related papers

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos [76.98722001848493]
LongShOTBench is a diagnostic benchmark for long-form multimodal video understanding.<n>It includes open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use.<n>LongShOTAgent is an agentic system that analyzes long videos via preprocessing, search, and iterative refinement.
arXiv Detail & Related papers (2025-12-18T18:59:27Z)
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z)
TextVidBench: A Benchmark for Long Video Scene Text Understanding [60.94150574231576]
We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
arXiv Detail & Related papers (2025-06-05T12:54:56Z)
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval [61.414236415351446]
We propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR)<n>MomentSeeker is created based on long and diverse videos, averaging over 1200 seconds in duration.<n>It covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning.
arXiv Detail & Related papers (2025-02-18T05:50:23Z)
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding [25.85614872348223]
Long-form egocentric video understanding provides rich contextual information and insights into long-term human behaviors.<n>Existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes.<n>We introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings.
arXiv Detail & Related papers (2025-01-12T15:07:03Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering [49.68215536040896]
LvBench is a long-form video understanding benchmark for versatile multi-modal question-answering.<n>We consider videos ranging from 70 seconds to 4 hours, covering single-scene, multi-scene, and full-scene contexts.<n>Our dataset comprises 20,061 question-answer pairs sourced from 100 carefully selected movies.
arXiv Detail & Related papers (2023-12-08T03:33:38Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.