BERTHA: Video Captioning Evaluation Via Transfer-Learned Human
Assessment
- URL: http://arxiv.org/abs/2201.10243v1
- Date: Tue, 25 Jan 2022 11:29:58 GMT
- Title: BERTHA: Video Captioning Evaluation Via Transfer-Learned Human
Assessment
- Authors: Luis Lebron, Yvette Graham, Kevin McGuinness, Konstantinos Kouramas,
Noel E. O'Connor
- Abstract summary: This paper presents a new method based on a deep learning model to evaluate video captioning systems.
The model is based on BERT, which is a language model that has been shown to work well in multiple NLP tasks.
The aim is for the model to learn to perform an evaluation similar to that of a human.
- Score: 16.57721566105298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating video captioning systems is a challenging task as there are
multiple factors to consider; for instance: the fluency of the caption,
multiple actions happening in a single scene, and the human bias of what is
considered important. Most metrics try to measure how similar the system
generated captions are to a single or a set of human-annotated captions. This
paper presents a new method based on a deep learning model to evaluate these
systems. The model is based on BERT, which is a language model that has been
shown to work well in multiple NLP tasks. The aim is for the model to learn to
perform an evaluation similar to that of a human. To do so, we use a dataset
that contains human evaluations of system generated captions. The dataset
consists of the human judgments of the captions produce by the system
participating in various years of the TRECVid video to text task. These
annotations will be made publicly available. BERTHA obtain favourable results,
outperforming the commonly used metrics in some setups.
Related papers
- Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training [44.008094698200026]
PAC-S++ is a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data.
We show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors.
arXiv Detail & Related papers (2024-10-09T18:00:09Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.