Is An Image Worth Five Sentences? A New Look into Semantics for
Image-Text Matching
- URL: http://arxiv.org/abs/2110.02623v1
- Date: Wed, 6 Oct 2021 09:54:28 GMT
- Title: Is An Image Worth Five Sentences? A New Look into Semantics for
Image-Text Matching
- Authors: Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas
- Abstract summary: We propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
We incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss.
- Score: 10.992151305603267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of image-text matching aims to map representations from different
modalities into a common joint visual-textual embedding. However, the most
widely used datasets for this task, MSCOCO and Flickr30K, are actually image
captioning datasets that offer a very limited set of relationships between
images and sentences in their ground-truth annotations. This limited ground
truth information forces us to use evaluation metrics based on binary
relevance: given a sentence query we consider only one image as relevant.
However, many other relevant images or captions may be present in the dataset.
In this work, we propose two metrics that evaluate the degree of semantic
relevance of retrieved items, independently of their annotated binary
relevance. Additionally, we incorporate a novel strategy that uses an image
captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be
optimized in a standard triplet loss. By incorporating our formulation to
existing models, a \emph{large} improvement is obtained in scenarios where
available training data is limited. We also demonstrate that the performance on
the annotated image-caption pairs is maintained while improving on other
non-annotated relevant items when employing the full training set. Code with
our metrics and adaptive margin formulation will be made public.
Related papers
- Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Revising Image-Text Retrieval via Multi-Modal Entailment [25.988058843564335]
Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
arXiv Detail & Related papers (2022-08-22T07:58:54Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Diverse Image Captioning with Context-Object Split Latent Spaces [22.95979735707003]
We introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts.
Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data.
arXiv Detail & Related papers (2020-11-02T13:33:20Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.