Image Retrieval from Contextual Descriptions
- URL: http://arxiv.org/abs/2203.15867v1
- Date: Tue, 29 Mar 2022 19:18:12 GMT
- Title: Image Retrieval from Contextual Descriptions
- Authors: Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo
Ponti, Siva Reddy
- Abstract summary: Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
- Score: 22.084939474881796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to integrate context, including perceptual and temporal cues,
plays a pivotal role in grounding the meaning of a linguistic utterance. In
order to measure to what extent current vision-and-language models master this
ability, we devise a new multimodal challenge, Image Retrieval from Contextual
Descriptions (ImageCoDe). In particular, models are tasked with retrieving the
correct image from a set of 10 minimally contrastive candidates based on a
contextual description. As such, each description contains only the details
that help distinguish between images. Because of this, descriptions tend to be
complex in terms of syntax and discourse and require drawing pragmatic
inferences. Images are sourced from both static pictures and video frames. We
benchmark several state-of-the-art models, including both cross-encoders such
as ViLBERT and bi-encoders such as CLIP, on ImageCoDe. Our results reveal that
these models dramatically lag behind human performance: the best variant
achieves an accuracy of 20.9 on video frames and 59.4 on static pictures,
compared with 90.8 in humans. Furthermore, we experiment with new model
variants that are better equipped to incorporate visual and temporal context
into their representations, which achieve modest gains. Our hope is that
ImageCoDE will foster progress in grounded language understanding by
encouraging models to focus on fine-grained visual differences.
Related papers
- StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - A Picture May Be Worth a Hundred Words for Visual Question Answering [26.83504716672634]
In image understanding, it is essential to use concise but detailed image representations.
Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks.
We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model.
arXiv Detail & Related papers (2021-06-25T06:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.