HL Dataset: Visually-grounded Description of Scenes, Actions and
Rationales
- URL: http://arxiv.org/abs/2302.12189v3
- Date: Mon, 25 Sep 2023 07:37:20 GMT
- Title: HL Dataset: Visually-grounded Description of Scenes, Actions and
Rationales
- Authors: Michele Cafagna, Kees van Deemter, Albert Gatt
- Abstract summary: We present a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions.
We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically.
- Score: 5.010418546872244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current captioning datasets focus on object-centric captions, describing the
visible objects in the image, e.g. "people eating food in a park". Although
these datasets are useful to evaluate the ability of Vision & Language models
to recognize and describe visual content, they do not support controlled
experiments involving model testing or fine-tuning, with more high-level
captions, which humans find easy and natural to produce. For example, people
often describe images based on the type of scene they depict ('people at a
holiday resort') and the actions they perform ('people having a picnic'). Such
descriptions draw on personal experience and commonsense assumptions. We
present the High-Level Dataset a dataset extending 14997 images from the COCO
dataset, aligned with a new set of 134,973 human-annotated (high-level)
captions collected along three axes: scenes, actions, and rationales. We
further extend this dataset with confidence scores collected from an
independent set of readers, as well as a set of narrative captions generated
synthetically, by combining each of the three axes. We describe this dataset
and analyse it extensively. We also present baseline results for the High-Level
Captioning task.
Related papers
- StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Understanding Cross-modal Interactions in V&L Models that Generate Scene
Descriptions [3.7957452405531256]
This paper explores the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level.
We show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene.
We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
arXiv Detail & Related papers (2022-11-09T15:33:51Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - RedCaps: web-curated image-text data created by the people, for the
people [12.58157541985447]
We introduce RedCaps -- a large-scale dataset of 12M image-text pairs collected from Reddit.
Images and captions from Reddit depict and describe a wide variety of objects and scenes.
We show that captioning models trained on RedCaps produce rich and varied captions preferred by humans, and learn visual representations that transfer to many downstream tasks.
arXiv Detail & Related papers (2021-11-22T18:59:34Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.