Related papers: Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation

Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation

URL: http://arxiv.org/abs/2512.21402v1
Date: Wed, 24 Dec 2025 19:43:59 GMT
Title: Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation
Authors: Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta, Pratik Narang, Dhruv Kumar,
Abstract summary: VideoScore-2 does not capture how specific audiovisual attributes drive real audience engagement.<n>We propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features.<n>Our approach advances toward robust and explainable video understanding.
Score: 8.15791379444665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.

Related papers

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z)
LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content [0.0]
The Learned User Significance Tracker (LUST) is a framework designed to analyze video content and quantify the thematic relevance of its segments.<n>The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs)<n>The LUST framework aims to provide a nuanced, temporally-aware measure of user-defined significance, outputting an annotated video with visualized relevance scores and comprehensive analytical logs.
arXiv Detail & Related papers (2025-08-06T11:48:51Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs [18.07249962240035]
Video large language models (Video-LLMs) are increasingly integrated into real-world applications that demand grounded multimodal reasoning.<n>Sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts.<n>We propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs.
arXiv Detail & Related papers (2025-06-08T15:00:21Z)
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning [15.363132825156477]
We introduce FIOVA, a human-centric benchmark tailored for evaluation of large vision-language models (LVLMs)<n>It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators.<n>We propose FIOVA-DQ, an event-level evaluation metric that incorporates cognitive weights derived from annotator consensus.
arXiv Detail & Related papers (2024-10-20T03:59:54Z)
DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips. The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z)
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.