Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching
- URL: http://arxiv.org/abs/2002.08510v1
- Date: Thu, 20 Feb 2020 00:51:01 GMT
- Title: Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching
- Authors: Tianlang Chen, Jiebo Luo
- Abstract summary: Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
- Score: 102.62343739435289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing image-text matching approaches typically infer the similarity of an
image-text pair by capturing and aggregating the affinities between the text
and each independent object of the image. However, they ignore the connections
between the objects that are semantically related. These objects may
collectively determine whether the image corresponds to a text or not. To
address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN)
which processes images and sentences symmetrically by recurrent neural networks
(RNN). In particular, given an input image-text pair, our model reorders the
image objects based on the positions of their most related words in the text.
In the same way as extracting the hidden features from word embeddings, the
model leverages RNN to extract high-level object features from the reordered
object inputs. We validate that the high-level object features contain useful
joint information of semantically related objects, which benefit the retrieval
task. To compute the image-text similarity, we incorporate a Multi-attention
Cross Matching Model into DP-RNN. It aggregates the affinity between objects
and words with cross-modality guided attention and self-attention. Our model
achieves the state-of-the-art performance on Flickr30K dataset and competitive
performance on MS-COCO dataset. Extensive experiments demonstrate the
effectiveness of our model.
Related papers
- ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval [89.30660533051514]
Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa.
Image-text retrieval models commonly learn to spurious correlations in the training data, such as frequent object co-occurrence.
We introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
arXiv Detail & Related papers (2023-04-06T21:45:46Z) - Relationformer: A Unified Framework for Image-to-Graph Generation [18.832626244362075]
This work proposes a unified one-stage transformer-based framework, namely Relationformer, that jointly predicts objects and their relations.
We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly.
We achieve state-of-the-art performance on multiple, diverse and multi-domain datasets.
arXiv Detail & Related papers (2022-03-19T00:36:59Z) - Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches.
We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - Fixed-size Objects Encoding for Visual Relationship Detection [16.339394922532282]
We propose a fixed-size object encoding method (FOE-VRD) to improve performance of visual relationship detection tasks.
It uses one fixed-size vector to encoding all objects in each input image to assist the process of relationship detection.
Experimental results on VRD database show that the proposed method works well on both predicate classification and relationship detection.
arXiv Detail & Related papers (2020-05-29T14:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.