Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
- URL: http://arxiv.org/abs/2504.14693v1
- Date: Sun, 20 Apr 2025 17:58:46 GMT
- Title: Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
- Authors: Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang,
- Abstract summary: We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters.<n>Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures.
- Score: 27.487587901232057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
Related papers
- Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos [44.36644075780221]
Video-MMMU is a benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos.<n>Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines.<n>A proposed knowledge gain metric, Deltaknowledge, quantifies improvement in performance after video viewing.
arXiv Detail & Related papers (2025-01-23T16:51:47Z) - Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges [15.850548556536538]
Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language.
An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities.
This survey provides a comprehensive overview of the recent advancements in LLMs.
arXiv Detail & Related papers (2024-12-04T11:14:06Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding [52.696422425058245]
MultiModal Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks.<n>Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding.
arXiv Detail & Related papers (2024-09-27T17:38:36Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.