Related papers: FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

URL: http://arxiv.org/abs/2507.06523v1
Date: Wed, 09 Jul 2025 03:51:27 GMT
Title: FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
Authors: Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du,
Abstract summary: VideoMLLMs have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks.<n>They often suffer hallucinations, generating content that contradicts the visual input.<n>Existing evaluation methods are limited to one task and also fail to assess hallucinations in open-ended, free-form responses.<n>We propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts.<n>We also introduce Post-Correction, a tool-based correction framework that revises hallucinated content.
Score: 30.111545374280194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.

Related papers

ARGUS: Hallucination and Omission Evaluation in Video-LLMs [86.73977434293973]
ARGUS is a VideoLLM benchmark that measures freeform video captioning performance.<n>By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics.
arXiv Detail & Related papers (2025-06-09T02:42:13Z)
VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP [44.51452778561945]
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks.<n>Existing approaches rely primarily on parameter-efficient fine-tuning of image-text pre-trained models.<n>We propose a video-to-text discretization framework to address limited interpretability and poor generalization due to inadequate temporal modeling.
arXiv Detail & Related papers (2025-03-24T07:27:19Z)
Expertized Caption Auto-Enhancement for Video-Text Retrieval [10.250004732070494]
This paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.<n>Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability.<n>Our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo.
arXiv Detail & Related papers (2025-02-05T04:51:46Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z)
Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space. We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.