RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network
- URL: http://arxiv.org/abs/2104.03015v2
- Date: Thu, 8 Apr 2021 23:28:15 GMT
- Title: RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network
- Authors: Minchul Shin, Yoonjae Cho, Byungsoo Ko, Geonmo Gu
- Abstract summary: We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
- Score: 19.017377597937617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the compositional learning of images and texts for
image retrieval. The query is given in the form of an image and text that
describes the desired modifications to the image; the goal is to retrieve the
target image that satisfies the given modifications and resembles the query by
composing information in both the text and image modalities. To accomplish this
task, we propose a simple new architecture using skip connections that can
effectively encode the errors between the source and target images in the
latent space. Furthermore, we introduce a novel method that combines the graph
convolutional network (GCN) with existing composition methods. We find that the
combination consistently improves the performance in a plug-and-play manner. We
perform thorough and exhaustive experiments on several widely used datasets,
and achieve state-of-the-art scores on the task with our model. To ensure
fairness in comparison, we suggest a strict standard for the evaluation because
a small difference in the training conditions can significantly affect the
final performance. We release our implementation, including that of all the
compared methods, for reproducibility.
Related papers
- ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Exploiting the relationship between visual and textual features in
social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture.
Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part.
Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z) - Image Retrieval for Structure-from-Motion via Graph Convolutional
Network [13.040952255039702]
We present a novel retrieval method based on Graph Convolutional Network (GCN) to generate accurate pairwise matches without costly redundancy.
By constructing a subgraph surrounding the query image as input data, we adopt a learnable GCN to exploit whether nodes in the subgraph have overlapping regions with the query photograph.
Experiments demonstrate that our method performs remarkably well on the challenging dataset of highly ambiguous and duplicated scenes.
arXiv Detail & Related papers (2020-09-17T04:03:51Z) - Graph Edit Distance Reward: Learning to Edit Scene Graph [69.39048809061714]
We propose a new method to edit the scene graph according to the user instructions, which has never been explored.
To be specific, in order to learn editing scene graphs as the semantics given by texts, we propose a Graph Edit Distance Reward.
In the context of text-editing image retrieval, we validate the effectiveness of our method in CSS and CRIR dataset.
arXiv Detail & Related papers (2020-08-15T04:52:16Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.