GROOViST: A Metric for Grounding Objects in Visual Storytelling
- URL: http://arxiv.org/abs/2310.17770v1
- Date: Thu, 26 Oct 2023 20:27:16 GMT
- Title: GROOViST: A Metric for Grounding Objects in Visual Storytelling
- Authors: Aditya K Surikuchi, Sandro Pezzelle, Raquel Fern\'andez
- Abstract summary: We focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images.
We propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments, and human intuitions on visual grounding.
- Score: 3.650221968508535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A proper evaluation of stories generated for a sequence of images -- the task
commonly referred to as visual storytelling -- must consider multiple aspects,
such as coherence, grammatical correctness, and visual grounding. In this work,
we focus on evaluating the degree of grounding, that is, the extent to which a
story is about the entities shown in the images. We analyze current metrics,
both designed for this purpose and for general vision-text alignment. Given
their observed shortcomings, we propose a novel evaluation tool, GROOViST, that
accounts for cross-modal dependencies, temporal misalignments (the fact that
the order in which entities appear in the story and the image sequence may not
match), and human intuitions on visual grounding. An additional advantage of
GROOViST is its modular design, where the contribution of each component can be
assessed and interpreted individually.
Related papers
- Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Quantitative analysis of visual representation of sign elements in
COVID-19 context [2.9409535911474967]
We propose using computer analysis to perform a quantitative analysis of the elements used in the visual creations produced in reference to the epidemic.
The images compiled in The Covid Art Museum's Instagram account to analyze the different elements used to represent subjective experiences with regard to a global event.
This research reveals that the elements that are repeated in images to create narratives and the relations of association that are established in the sample.
arXiv Detail & Related papers (2021-12-15T15:54:53Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Towards Coherent Visual Storytelling with Ordered Image Attention [73.422281039592]
We develop ordered image attention (OIA) and Image-Sentence Attention (ISA)
OIA models interactions between the sentence-corresponding image and important regions in other images of the sequence.
To generate the story's sentences, we then highlight important image attention vectors with an Image-Sentence Attention (ISA)
arXiv Detail & Related papers (2021-08-04T17:12:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.