Cross-Modal Coherence for Text-to-Image Retrieval
- URL: http://arxiv.org/abs/2109.11047v1
- Date: Wed, 22 Sep 2021 21:31:27 GMT
- Title: Cross-Modal Coherence for Text-to-Image Retrieval
- Authors: Malihe Alikhani, Fangda Han, Hareesh Ravi, Mubbasir Kapadia, Vladimir
Pavlovic, Matthew Stone
- Abstract summary: We train a Cross-Modal Coherence Modelfor text-to-image retrieval task.
Our analysis shows that models trained with image-text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models.
Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
- Score: 35.82045187976062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Common image-text joint understanding techniques presume that images and the
associated text can universally be characterized by a single implicit model.
However, co-occurring images and text can be related in qualitatively different
ways, and explicitly modeling it could improve the performance of current joint
understanding models. In this paper, we train a Cross-Modal Coherence Modelfor
text-to-image retrieval task. Our analysis shows that models trained with
image--text coherence relations can retrieve images originally paired with
target text more often than coherence-agnostic models. We also show via human
evaluation that images retrieved by the proposed coherence-aware model are
preferred over a coherence-agnostic baseline by a huge margin. Our findings
provide insights into the ways that different modalities communicate and the
role of coherence relations in capturing commonsense inferences in text and
imagery.
Related papers
- Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching [7.7559623054251]
Image-text matching (ITM) is a fundamental problem in computer vision.
We propose a Hybrid-modal feature the Interaction with multiple Enhancements (termed textitHire) for image-text matching.
In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects.
arXiv Detail & Related papers (2024-06-05T13:10:55Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Dual Relation Alignment for Composed Image Retrieval [24.812654620141778]
We argue for the existence of two types of relations in composed image retrieval.
The explicit relation pertains to the reference image & complementary text-target image.
We propose a new framework for composed image retrieval, termed dual relation alignment.
arXiv Detail & Related papers (2023-09-05T12:16:14Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Clue: Cross-modal Coherence Modeling for Caption Generation [38.12058832538408]
We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning.
We introduce a new task for learning inferences in imagery and text, and show that these coherence annotations can be exploited to learn relation classifiers as an intermediary step.
The results show a dramatic improvement in the consistency and quality of the generated captions with respect to information needs specified via coherence relations.
arXiv Detail & Related papers (2020-05-02T19:28:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.