Image Retrieval from Contextual Descriptions
- URL: http://arxiv.org/abs/2203.15867v1
- Date: Tue, 29 Mar 2022 19:18:12 GMT
- Title: Image Retrieval from Contextual Descriptions
- Authors: Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo
Ponti, Siva Reddy
- Abstract summary: Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
- Score: 22.084939474881796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to integrate context, including perceptual and temporal cues,
plays a pivotal role in grounding the meaning of a linguistic utterance. In
order to measure to what extent current vision-and-language models master this
ability, we devise a new multimodal challenge, Image Retrieval from Contextual
Descriptions (ImageCoDe). In particular, models are tasked with retrieving the
correct image from a set of 10 minimally contrastive candidates based on a
contextual description. As such, each description contains only the details
that help distinguish between images. Because of this, descriptions tend to be
complex in terms of syntax and discourse and require drawing pragmatic
inferences. Images are sourced from both static pictures and video frames. We
benchmark several state-of-the-art models, including both cross-encoders such
as ViLBERT and bi-encoders such as CLIP, on ImageCoDe. Our results reveal that
these models dramatically lag behind human performance: the best variant
achieves an accuracy of 20.9 on video frames and 59.4 on static pictures,
compared with 90.8 in humans. Furthermore, we experiment with new model
variants that are better equipped to incorporate visual and temporal context
into their representations, which achieve modest gains. Our hope is that
ImageCoDE will foster progress in grounded language understanding by
encouraging models to focus on fine-grained visual differences.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models [36.19590638188108]
We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data.
Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context.
Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
arXiv Detail & Related papers (2023-04-21T03:45:59Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.