Differentiated Relevances Embedding for Group-based Referring Expression
Comprehension
- URL: http://arxiv.org/abs/2203.06382v2
- Date: Fri, 2 Jun 2023 03:39:21 GMT
- Title: Differentiated Relevances Embedding for Group-based Referring Expression
Comprehension
- Authors: Fuhai Chen, Xuri Ge, Xiaoshuai Sun, Yue Gao, Jianzhuang Liu, Fufeng
Chen, Wenjie Li
- Abstract summary: Key of referring expression comprehension lies in capturing the cross-modal visual-linguistic relevance.
We propose the multi-group self-paced relevance learning schema to adaptively assign within-group object-expression pairs with different priorities.
Experiments on three standard REC benchmarks demonstrate the effectiveness and superiority of our method.
- Score: 57.52186959089885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The key of referring expression comprehension lies in capturing the
cross-modal visual-linguistic relevance. Existing works typically model the
cross-modal relevance in each image, where the anchor object/expression and
their positive expression/object have the same attribute as the negative
expression/object, but with different attribute values. These
objects/expressions are exclusively utilized to learn the implicit
representation of the attribute by a pair of different values, which however
impedes the accuracies of the attribute representations, expression/object
representations, and their cross-modal relevances since each anchor
object/expression usually has multiple attributes while each attribute usually
has multiple potential values. To this end, we investigate a novel REC problem
named Group-based REC, where each object/expression is simultaneously employed
to construct the multiple triplets among the semantically similar images. To
tackle the explosion of the negatives and the differentiation of the
anchor-negative relevance scores, we propose the multi-group self-paced
relevance learning schema to adaptively assign within-group object-expression
pairs with different priorities based on their cross-modal relevances. Since
the average cross-modal relevance varies a lot across different groups, we
further design an across-group relevance constraint to balance the bias of the
group priority. Experiments on three standard REC benchmarks demonstrate the
effectiveness and superiority of our method.
Related papers
- Enhancing Neural Subset Selection: Integrating Background Information into Set Representations [53.15923939406772]
We show that when the target value is conditioned on both the input set and subset, it is essential to incorporate an textitinvariant sufficient statistic of the superset into the subset of interest.
This ensures that the output value remains invariant to permutations of the subset and its corresponding superset, enabling identification of the specific superset from which the subset originated.
arXiv Detail & Related papers (2024-02-05T16:09:35Z) - Towards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute association [23.06058982328083]
We focus on the wide existence of reporting bias in visual-language datasets.
We propose a bimodal augmentation (BiAug) approach to mitigate this bias.
BiAug synthesizes visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives.
arXiv Detail & Related papers (2023-10-02T16:48:50Z) - Co-Salient Object Detection with Semantic-Level Consensus Extraction and
Dispersion [27.120768849942145]
Co-salient object detection aims to highlight the common salient object in each image.
We propose a hierarchical Transformer module for extracting semantic-level consensus.
A Transformer-based dispersion module takes into account the variation of the co-salient object in different scenes.
arXiv Detail & Related papers (2023-09-14T14:39:07Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - Improving Object Detection and Attribute Recognition by Feature
Entanglement Reduction [26.20319853343761]
We show that object detection should be attribute-independent and attributes be largely object-independent.
We disentangle them by the use of a two-stream model where the category and attribute features are computed independently but the classification heads share Regions of Interest (RoIs)
Compared with a traditional single-stream model, our model shows significant improvements over VG-20, a subset of Visual Genome, on both supervised and attribute transfer tasks.
arXiv Detail & Related papers (2021-08-25T22:27:06Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Attention Guided Semantic Relationship Parsing for Visual Question
Answering [36.84737596725629]
Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform Vision-Language tasks such as Visual Question Answering (VQA)
Existing VQA models represent relationships as a combination of object-level visual features which constrain a model to express interactions between objects in a single domain, while the model is trying to solve a multi-modal task.
In this paper, we propose a general purpose semantic relationship which generates a semantic feature vector for each subject-predicate-object triplet in an image, and a Mutual and Self Attention mechanism that learns to identify relationship triplets that are important to
arXiv Detail & Related papers (2020-10-05T00:23:49Z) - Understanding Adversarial Examples from the Mutual Influence of Images
and Perturbations [83.60161052867534]
We analyze adversarial examples by disentangling the clean images and adversarial perturbations, and analyze their influence on each other.
Our results suggest a new perspective towards the relationship between images and universal perturbations.
We are the first to achieve the challenging task of a targeted universal attack without utilizing original training data.
arXiv Detail & Related papers (2020-07-13T05:00:09Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.