Models See Hallucinations: Evaluating the Factuality in Video Captioning
- URL: http://arxiv.org/abs/2303.02961v1
- Date: Mon, 6 Mar 2023 08:32:50 GMT
- Title: Models See Hallucinations: Evaluating the Factuality in Video Captioning
- Authors: Hui Liu, Xiaojun Wan
- Abstract summary: We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets.
We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field.
We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
- Score: 57.85548187177109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning aims to describe events in a video with natural language. In
recent years, many works have focused on improving captioning models'
performance. However, like other text generation tasks, it risks introducing
factual errors not supported by the input video. These factual errors can
seriously affect the quality of the generated text, sometimes making it
completely unusable. Although factual consistency has received much research
attention in text-to-text tasks (e.g., summarization), it is less studied in
the context of vision-based text generation. In this work, we conduct a
detailed human evaluation of the factuality in video captioning and collect two
annotated factuality datasets. We find that 57.0% of the model-generated
sentences have factual errors, indicating it is a severe problem in this field.
However, existing evaluation metrics are mainly based on n-gram matching and
show little correlation with human factuality annotation. We further propose a
weakly-supervised, model-based factuality metric FactVC, which outperforms
previous metrics on factuality evaluation of video captioning. The datasets and
metrics will be released to promote future research for video captioning.
Related papers
- Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative [19.79736018383692]
Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative.
We propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently.
arXiv Detail & Related papers (2024-06-10T17:34:24Z) - Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos.
We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore)
This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z) - Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning [90.13978453378768]
We introduce a comprehensive typology of factual errors in generated chart captions.
A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models.
Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies.
arXiv Detail & Related papers (2023-12-15T19:16:21Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained
Embedding Matching [90.98122161162644]
Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions.
We propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning.
We exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore.
arXiv Detail & Related papers (2021-11-17T06:02:43Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.