Related papers: NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

URL: http://arxiv.org/abs/2406.06499v3
Date: Sat, 15 Feb 2025 10:01:13 GMT
Title: NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Authors: Asmar Nadeem, Faegheh Sardari, Robert Dawes, Syed Sameed Husain, Adrian Hilton, Armin Mustafa,
Abstract summary: Existing video captioning benchmarks and models lack causal-temporal narrative.<n>This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content.<n>We propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics.
Score: 19.79736018383692
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Existing video captioning benchmarks and models lack causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models in articulating the causal and temporal aspects of video content: 17.88 and 17.44 CIDEr on the MSVD-CTN and MSRVTT-CTN datasets, respectively. Cross-dataset evaluations further showcase CEN's strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit https://narrativebridge.github.io/.

Related papers

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control [59.6128550986024]
NarraScore is a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic.<n>NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism.<n>NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead.
arXiv Detail & Related papers (2026-02-09T09:39:42Z)
HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression [7.305586811678626]
We introduce the E-commerce Hierarchical Video Captioning dataset with dual-granularity, temporally grounded annotations.<n>We adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions.<n>We propose the Scene-Primed ASR-anchored Caption (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues.
arXiv Detail & Related papers (2026-01-12T09:41:31Z)
CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models [66.56549019393042]
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order.<n>We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context.
arXiv Detail & Related papers (2026-01-08T10:03:07Z)
NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models [8.6767620170781]
Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering.<n>Many models and training methods explicitly encourage continuity across events to enhance narrative coherence.<n>We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context.
arXiv Detail & Related papers (2025-11-09T17:41:11Z)
Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models [10.585096070697348]
We introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions.<n>VideoNarrator addresses challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models can function as caption generators.<n>Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations.
arXiv Detail & Related papers (2025-07-22T22:16:37Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [36.65883181090953]
ImplicitQA is a novel benchmark designed to test models on implicit reasoning.<n>It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions [3.9633773442108873]
We propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity.
arXiv Detail & Related papers (2025-03-07T07:15:06Z)
Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs. CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL) We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z)
Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z)
Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships. We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains. LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z)
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [47.88160253507823]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism. CTGM incorporates the Temporal Information (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention.
arXiv Detail & Related papers (2024-08-15T14:47:44Z)
Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models [12.907590808274358]
We propose a novel Rich-contextual Diffusion Models (RCDMs) to enhance story generation's semantic consistency and temporal consistency. RCDMs can generate consistent stories with a single forward inference compared to autoregressive models.
arXiv Detail & Related papers (2024-07-02T17:58:07Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models [6.668241588219693]
Visual storytelling is increasingly desired beyond real-world imagery. Current techniques, which typically use autoregressive decoders, suffer from low inference speed and are not well-suited for synthetic scenes. We propose a novel diffusion-based system DiffuVST, which models a series of visual descriptions as a single conditional denoising process.
arXiv Detail & Related papers (2023-12-12T08:40:38Z)
Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z)
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions. We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z)
Discourse Analysis for Evaluating Coherence in Video Paragraph Captions [99.37090317971312]
We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs. Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos. Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
arXiv Detail & Related papers (2022-01-17T04:23:08Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.