Image-text Retrieval via Preserving Main Semantics of Vision
- URL: http://arxiv.org/abs/2304.10254v2
- Date: Fri, 28 Apr 2023 08:09:54 GMT
- Title: Image-text Retrieval via Preserving Main Semantics of Vision
- Authors: Xu Zhang, Xinzheng Niu, Philippe Fournier-Viger, Xudong Dai
- Abstract summary: This paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL)
We leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image.
Experiments on two benchmark datasets demonstrate the superior performance of our method.
- Score: 5.376441473801597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval is one of the major tasks of cross-modal retrieval.
Several approaches for this task map images and texts into a common space to
create correspondences between the two modalities. However, due to the content
(semantics) richness of an image, redundant secondary information in an image
may cause false matches. To address this issue, this paper presents a semantic
optimization approach, implemented as a Visual Semantic Loss (VSL), to assist
the model in focusing on an image's main content. This approach is inspired by
how people typically annotate the content of an image by describing its main
content. Thus, we leverage the annotated texts corresponding to an image to
assist the model in capturing the main content of the image, reducing the
negative impact of secondary content. Extensive experiments on two benchmark
datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our
method. The code is available at: https://github.com/ZhangXu0963/VSL.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Unpaired Image-to-Image Translation via Latent Energy Transport [61.62293304236371]
Image-to-image translation aims to preserve source contents while translating to discriminative target styles between two visual domains.
In this paper, we propose to deploy an energy-based model (EBM) in the latent space of a pretrained autoencoder for this task.
Our model is the first to be applicable to 1024$times$1024-resolution unpaired image translation.
arXiv Detail & Related papers (2020-12-01T17:18:58Z) - Content-based Image Retrieval and the Semantic Gap in the Deep Learning
Era [9.59805804476193]
Content-based image retrieval has seen astonishing progress over the past decade, especially for the task of retrieving images of the same object.
This brings rise to the question: Do the recent advances in instance retrieval transfer to more generic image retrieval scenarios?
We first provide a brief overview of the most relevant milestones of instance retrieval. We then apply them to a semantic image retrieval task and find that they perform inferior to much less sophisticated and more generic methods.
We conclude that the key problem for the further advancement of semantic image retrieval lies in the lack of a standardized task definition and an appropriate benchmark dataset.
arXiv Detail & Related papers (2020-11-12T17:00:08Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.