Whether you can locate or not? Interactive Referring Expression
Generation
- URL: http://arxiv.org/abs/2308.09977v1
- Date: Sat, 19 Aug 2023 10:53:32 GMT
- Title: Whether you can locate or not? Interactive Referring Expression
Generation
- Authors: Fulong Ye, Yuxing Long, Fangxiang Feng, Xiaojie Wang
- Abstract summary: We propose an Interactive REG (IREG) model that can interact with a real REC model.
IREG outperforms previous state-of-the-art methods on popular evaluation metrics.
- Score: 12.148963878497243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Generation (REG) aims to generate unambiguous Referring
Expressions (REs) for objects in a visual scene, with a dual task of Referring
Expression Comprehension (REC) to locate the referred object. Existing methods
construct REG models independently by using only the REs as ground truth for
model training, without considering the potential interaction between REG and
REC models. In this paper, we propose an Interactive REG (IREG) model that can
interact with a real REC model, utilizing signals indicating whether the object
is located and the visual region located by the REC model to gradually modify
REs. Our experimental results on three RE benchmark datasets, RefCOCO,
RefCOCO+, and RefCOCOg show that IREG outperforms previous state-of-the-art
methods on popular evaluation metrics. Furthermore, a human evaluation shows
that IREG generates better REs with the capability of interaction.
Related papers
- Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding [3.8673630752805446]
We propose an approach to referring expression generation (REG) that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate.
Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs.
arXiv Detail & Related papers (2024-09-09T15:33:07Z) - Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression [44.36417883611282]
We introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals during reasoning.
We also introduce an expression-guided regression strategy (EGR) to refine location prediction.
Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.
arXiv Detail & Related papers (2024-09-05T09:44:43Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - GenRES: Rethinking Evaluation for Generative Relation Extraction in the
Era of Large Language Models [48.56814147033251]
We introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results.
With GenRES, we empirically identified that precision/recall fails to justify the performance of GRE methods.
Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality.
arXiv Detail & Related papers (2024-02-16T15:01:24Z) - Intrinsic Task-based Evaluation for Referring Expression Generation [9.322715583523928]
Referring Expressions (REs) generated by state-of-the-art neural models were not only indistinguishable from the REs in textscwebnlg but also from the REs generated by a simple rule-based system.
Here, we argue that this limitation could stem from the use of a purely ratings-based human evaluation.
We propose an intrinsic task-based evaluation for REG models, in which, in addition to rating the quality of REs, participants were asked to accomplish two meta-level tasks.
arXiv Detail & Related papers (2024-02-12T06:21:35Z) - Continual Referring Expression Comprehension via Dual Modular
Memorization [133.46886428655426]
Referring Expression (REC) aims to localize an image region of a given object described by a natural-language expression.
Existing REC algorithms make a strong assumption that training data feeding into a model are given upfront, which degrades its practicality for real-world scenarios.
In this paper, we propose Continual Referring Expression (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks.
In order to continuously improve the model on sequential tasks without forgetting prior learned knowledge and without repeatedly re-training from a scratch, we propose an effective baseline method named Dual Modular Memorization
arXiv Detail & Related papers (2023-11-25T02:58:51Z) - GREC: Generalized Referring Expression Comprehension [52.83101289813662]
This study introduces a new benchmark termed as Generalized Referring Expression (GREC)
This benchmark extends the classic REC by permitting expressions to describe any number of target objects.
To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO.
arXiv Detail & Related papers (2023-08-30T17:58:50Z) - Towards Unifying Reference Expression Generation and Comprehension [22.72363956296498]
We propose a unified model for REG and REC, named UniRef.
It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention.
We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora.
arXiv Detail & Related papers (2022-10-24T09:53:41Z) - ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
Comprehension [114.85628613911713]
Large-scale pre-trained models are useful for image classification across domains.
We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC.
arXiv Detail & Related papers (2022-04-12T17:55:38Z) - Multi-task Collaborative Network for Joint Referring Expression
Comprehension and Segmentation [135.67558811281984]
We propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning offerring expression comprehension (REC) and segmentation (RES)
In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent.
We address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS)
arXiv Detail & Related papers (2020-03-19T14:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.