Towards Unifying Reference Expression Generation and Comprehension
- URL: http://arxiv.org/abs/2210.13076v1
- Date: Mon, 24 Oct 2022 09:53:41 GMT
- Title: Towards Unifying Reference Expression Generation and Comprehension
- Authors: Duo Zheng, Tao Kong, Ya Jing, Jiaan Wang, Xiaojie Wang
- Abstract summary: We propose a unified model for REG and REC, named UniRef.
It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention.
We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora.
- Score: 22.72363956296498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reference Expression Generation (REG) and Comprehension (REC) are two highly
correlated tasks. Modeling REG and REC simultaneously for utilizing the
relation between them is a promising way to improve both. However, the problem
of distinct inputs, as well as building connections between them in a single
model, brings challenges to the design and training of the joint model. To
address the problems, we propose a unified model for REG and REC, named UniRef.
It unifies these two tasks with the carefully-designed Image-Region-Text Fusion
layer (IRTF), which fuses the image, region and text via the image
cross-attention and region cross-attention. Additionally, IRTF could generate
pseudo input regions for the REC task to enable a uniform way for sharing the
identical representation space across the REC and REG. We further propose
Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region
Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM
and TRP are directly related to REG and REC, respectively, but could help each
other. We conduct extensive experiments on three benchmark datasets, RefCOCO,
RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms
previous state-of-the-art methods on both REG and REC.
Related papers
- OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer.
To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM)
Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z) - Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding [3.8673630752805446]
We propose an approach to referring expression generation (REG) that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate.
Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs.
arXiv Detail & Related papers (2024-09-09T15:33:07Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Continual Referring Expression Comprehension via Dual Modular
Memorization [133.46886428655426]
Referring Expression (REC) aims to localize an image region of a given object described by a natural-language expression.
Existing REC algorithms make a strong assumption that training data feeding into a model are given upfront, which degrades its practicality for real-world scenarios.
In this paper, we propose Continual Referring Expression (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks.
In order to continuously improve the model on sequential tasks without forgetting prior learned knowledge and without repeatedly re-training from a scratch, we propose an effective baseline method named Dual Modular Memorization
arXiv Detail & Related papers (2023-11-25T02:58:51Z) - Whether you can locate or not? Interactive Referring Expression
Generation [12.148963878497243]
We propose an Interactive REG (IREG) model that can interact with a real REC model.
IREG outperforms previous state-of-the-art methods on popular evaluation metrics.
arXiv Detail & Related papers (2023-08-19T10:53:32Z) - How Fragile is Relation Extraction under Entity Replacements? [70.34001923252711]
Relation extraction (RE) aims to extract the relations between entity names from the textual context.
Existing work has found that the RE models the entity name patterns to make RE predictions while ignoring the textual context.
This motivates us to raise the question: are RE models robust to the entity replacements?''
arXiv Detail & Related papers (2023-05-22T23:53:32Z) - Automatically Generating Counterfactuals for Relation Exaction [18.740447044960796]
relation extraction (RE) is a fundamental task in natural language processing.
Current deep neural models have achieved high accuracy but are easily affected by spurious correlations.
We develop a novel approach to derive contextual counterfactuals for entities.
arXiv Detail & Related papers (2022-02-22T04:46:10Z) - Robust Reference-based Super-Resolution via C2-Matching [77.51610726936657]
Super-Resolution (Ref-SR) has recently emerged as a promising paradigm to enhance a low-resolution (LR) input image by introducing an additional high-resolution (HR) reference image.
Existing Ref-SR methods mostly rely on implicit correspondence matching to borrow HR textures from reference images to compensate for the information loss in input images.
We propose C2-Matching, which produces explicit robust matching crossing transformation and resolution.
arXiv Detail & Related papers (2021-06-03T16:40:36Z) - Multi-task Collaborative Network for Joint Referring Expression
Comprehension and Segmentation [135.67558811281984]
We propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning offerring expression comprehension (REC) and segmentation (RES)
In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent.
We address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS)
arXiv Detail & Related papers (2020-03-19T14:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.