Related papers: An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

URL: http://arxiv.org/abs/2509.26225v1
Date: Tue, 30 Sep 2025 13:23:40 GMT
Title: An Experimental Study on Generating Plausible Textual Explanations for Video Summarization
Authors: Thomas Eleftheriadis, Evlampios Apostolidis, Vasileios Mezaris,
Abstract summary: We extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model.<n>We focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations.<n>We conduct an experimental study using a SOTA method and two datasets for video summarization, to examine whether the more faithful explanations are also the more plausible ones.
Score: 5.531123091747035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans' reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.

Related papers

MetaExplainer: A Framework to Generate Multi-Type User-Centered Explanations for AI Systems [1.9811010456089264]
We introduce MetaExplainer, a neuro-symbolic framework designed to generate user-centered explanations.<n>Our approach employs a three-stage process: first, we decompose user questions into machine-readable formats using state-of-the-art large language models (LLM); second, we delegate the task of generating system recommendations to model explainer methods; and finally, we synthesize natural language explanations that summarize the explainer outputs.
arXiv Detail & Related papers (2025-08-01T04:01:40Z)
VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
Explanatory Summarization with Discourse-Driven Planning [58.449423507036414]
We present a plan-based approach that leverages discourse frameworks to organize summary generation and guide explanatory sentences.<n>Specifically, we propose two discourse-driven planning strategies, where the plan is conditioned as part of the input or part of the output prefix.<n> Empirical experiments on three lay summarization datasets show that our approach outperforms existing state-of-the-art methods in terms of summary quality.
arXiv Detail & Related papers (2025-04-27T19:47:36Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling. Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z)
An Integrated Framework for Multi-Granular Explanation of Video Summarization [6.076406622352117]
This framework integrates methods for producing explanations both at the fragment level and at the visual object level. The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets.
arXiv Detail & Related papers (2024-05-16T13:25:36Z)
Discourse Analysis for Evaluating Coherence in Video Paragraph Captions [99.37090317971312]
We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs. Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos. Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
arXiv Detail & Related papers (2022-01-17T04:23:08Z)
Video Summarization Using Deep Neural Networks: A Survey [72.98424352264904]
Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization.
arXiv Detail & Related papers (2021-01-15T11:41:29Z)
Sequential Explanations with Mental Model-Based Policies [20.64968620536829]
We apply a reinforcement learning framework to provide explanations based on the explainee's mental model. We conduct novel online human experiments where explanations are selected and presented to participants. Our results suggest that mental model-based policies may increase interpretability over multiple sequential explanations.
arXiv Detail & Related papers (2020-07-17T14:43:46Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.