Consensus-Aware Visual-Semantic Embedding for Image-Text Matching
- URL: http://arxiv.org/abs/2007.08883v2
- Date: Mon, 1 Feb 2021 12:35:10 GMT
- Title: Consensus-Aware Visual-Semantic Embedding for Image-Text Matching
- Authors: Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma
- Abstract summary: Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
- Score: 69.34076386926984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn
their representations, thereby exploiting their matching relationships and
making the corresponding alignments. Such approaches only exploit the
superficial associations contained in the instance pairwise data, with no
consideration of any external commonsense knowledge, which may hinder their
capabilities to reason the higher-level relationships between image and text.
In this paper, we propose a Consensus-aware Visual-Semantic Embedding (CVSE)
model to incorporate the consensus information, namely the commonsense
knowledge shared between both modalities, into image-text matching.
Specifically, the consensus information is exploited by computing the
statistical co-occurrence correlations between the semantic concepts from the
image captioning corpus and deploying the constructed concept correlation graph
to yield the consensus-aware concept (CAC) representations. Afterwards, CVSE
learns the associations and alignments between image and text based on the
exploited consensus as well as the instance-level representations for both
modalities. Extensive experiments conducted on two public datasets verify that
the exploited consensus makes significant contributions to constructing more
meaningful visual-semantic embeddings, with the superior performances over the
state-of-the-art approaches on the bidirectional image and text retrieval task.
Our code of this paper is available at: https://github.com/BruceW91/CVSE.
Related papers
- Dual Relation Alignment for Composed Image Retrieval [24.812654620141778]
We argue for the existence of two types of relations in composed image retrieval.
The explicit relation pertains to the reference image & complementary text-target image.
We propose a new framework for composed image retrieval, termed dual relation alignment.
arXiv Detail & Related papers (2023-09-05T12:16:14Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval.
We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects.
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z) - CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information.
We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.