Related papers: Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval

Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval

URL: http://arxiv.org/abs/2105.07391v1
Date: Sun, 16 May 2021 09:43:25 GMT
Title: Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval
Authors: Kazuya Ueki
Abstract summary: We focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area. We provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching. A description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented.
Score: 0.6091702876917279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual-semantic embedding is an interesting research topic because it is useful for various tasks, such as visual question answering (VQA), image-text retrieval, image captioning, and scene graph generation. In this paper, we focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area. First, we provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching and how the technology has evolved over time. In addition, a description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented. We also introduce the implementation available on github for use in confirming the accuracy of experiments and for further improvement. We hope that this survey paper will encourage researchers to further develop their research on bridging images and languages.

Related papers

Text-to-Image Cross-Modal Generation: A Systematic Review [0.0]
We review research on generating visual data from text from the angle of "cross-modal generation" We provide a breakdown of text-to-image generation into various flavors of image-from-text methods, video-from-text methods, image editing, self-supervised and graph-based approaches.
arXiv Detail & Related papers (2024-01-21T23:54:05Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval. We first examine the related concerns and why the focus is on image-text retrieval tasks. We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z)
Deep Learning Approaches on Image Captioning: A Review [0.5852077003870417]
Image captioning aims to generate natural language descriptions for visual content in the form of still images. Deep learning and vision-language pre-training techniques have revolutionized the field, leading to more sophisticated methods and improved performance. We address the challenges faced in this field by emphasizing issues such as object hallucination, missing context, illumination conditions, contextual understanding, and referring expressions. We identify several potential future directions for research in this area, which include tackling the information misalignment problem between image and text modalities, mitigating dataset bias, incorporating vision-language pre-training methods to enhance caption generation, and developing improved evaluation tools to accurately
arXiv Detail & Related papers (2022-01-31T00:39:37Z)
Deep Image Deblurring: A Survey [165.32391279761006]
Deblurring is a classic problem in low-level computer vision, which aims to recover a sharp image from a blurred input image. Recent advances in deep learning have led to significant progress in solving this problem.
arXiv Detail & Related papers (2022-01-26T01:31:30Z)
A Thorough Review on Recent Deep Learning Methodologies for Image Captioning [0.0]
It is becoming increasingly difficult to keep up with the latest research and findings in the field of image captioning. This review paper serves as a roadmap for researchers to keep up to date with the latest contributions made in the field of image caption generation.
arXiv Detail & Related papers (2021-07-28T00:54:59Z)
From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence. Research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z)
Telling the What while Pointing the Where: Fine-grained Mouse Trace and Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z)
A Decade Survey of Content Based Image Retrieval using Deep Learning [13.778851745408133]
This paper presents a comprehensive survey of deep learning based developments in the past decade for content based image retrieval. The similarity between the representative features of the query image and dataset images is used to rank the images for retrieval. Deep learning has emerged as a dominating alternative of hand-designed feature engineering from a decade.
arXiv Detail & Related papers (2020-11-23T02:12:30Z)
Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph. We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.