End-to-end Semantic Object Detection with Cross-Modal Alignment
- URL: http://arxiv.org/abs/2302.05200v1
- Date: Fri, 10 Feb 2023 12:06:18 GMT
- Title: End-to-end Semantic Object Detection with Cross-Modal Alignment
- Authors: Silvan Ferreira, Allan Martins, Ivanovitch Silva
- Abstract summary: Proposal-text alignment is performed using contrastive learning, producing a score for each proposal that reflects its semantic alignment with the text query.
The Region Proposal Network (RPN) is used to generate object proposals, and the end-to-end training process allows for an efficient and effective solution for semantic image search.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Traditional semantic image search methods aim to retrieve images that match
the meaning of the text query. However, these methods typically search for
objects on the whole image, without considering the localization of objects
within the image. This paper presents an extension of existing object detection
models for semantic image search that considers the semantic alignment between
object proposals and text queries, with a focus on searching for objects within
images. The proposed model uses a single feature extractor, a pre-trained
Convolutional Neural Network, and a transformer encoder to encode the text
query. Proposal-text alignment is performed using contrastive learning,
producing a score for each proposal that reflects its semantic alignment with
the text query. The Region Proposal Network (RPN) is used to generate object
proposals, and the end-to-end training process allows for an efficient and
effective solution for semantic image search. The proposed model was trained
end-to-end, providing a promising solution for semantic image search that
retrieves images that match the meaning of the text query and generates
semantically relevant object proposals.
Related papers
- Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs [44.48400303207482]
The objective of a zero-shot composed image retrieval (CIR) is to retrieve the target image using a query image and a query text.
Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text.
We propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs.
arXiv Detail & Related papers (2024-06-27T02:10:30Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Bridging the Gap between Local Semantic Concepts and Bag of Visual Words
for Natural Scene Image Retrieval [0.0]
A typical content-based image retrieval system deals with the query image and images in the dataset as a collection of low-level features.
Top ranked images in the retrieved list, which have high similarities to the query image, may be different from the query image in terms of the semantic interpretation of the user.
This paper investigates how natural scene retrieval can be performed using the bag of visual word model and the distribution of local semantic concepts.
arXiv Detail & Related papers (2022-10-17T09:10:50Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Text-based Person Search in Full Images via Semantic-Driven Proposal
Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks.
To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.