A Unified Mutual Supervision Framework for Referring Expression
Segmentation and Generation
- URL: http://arxiv.org/abs/2211.07919v1
- Date: Tue, 15 Nov 2022 06:08:39 GMT
- Title: A Unified Mutual Supervision Framework for Referring Expression
Segmentation and Generation
- Authors: Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang
- Abstract summary: Reference Expression (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained.
We propose a unified mutual supervision framework that enables two tasks to improve each other.
- Score: 21.27400500728834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reference Expression Segmentation (RES) and Reference Expression Generation
(REG) are mutually inverse tasks that can be naturally jointly trained. Though
recent work has explored such joint training, the mechanism of how RES and REG
can benefit each other is still unclear. In this paper, we propose a unified
mutual supervision framework that enables two tasks to improve each other. Our
mutual supervision contains two directions. On the one hand, Disambiguation
Supervision leverages the expression unambiguity measurement provided by RES to
enhance the language generation of REG. On the other hand, Generation
Supervision uses expressions automatically generated by REG to scale up the
training of RES. Such unified mutual supervision effectively improves two tasks
by solving their bottleneck problems. Extensive experiments show that our
approach significantly outperforms all existing methods on REG and RES tasks
under the same setting, and detailed ablation studies demonstrate the
effectiveness of all components in our framework.
Related papers
- CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards [53.36917093757101]
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs)<n>We introduce textbfCogDual, a novel RPLA adopting a textitcognize-then-respond reasoning paradigm.<n>By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment.
arXiv Detail & Related papers (2025-07-23T02:26:33Z) - WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation [11.906318282459942]
We propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture.<n>In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module( CCM)
arXiv Detail & Related papers (2025-05-24T13:05:17Z) - Subtask-Aware Visual Reward Learning from Segmented Demonstrations [97.80917991633248]
This paper introduces REDS: REward learning from Demonstration with Demonstrations, a novel reward learning framework.
We train a dense reward function conditioned on video segments and their corresponding subtasks to ensure alignment with ground-truth reward signals.
Our experiments show that REDS significantly outperforms baseline methods on complex robotic manipulation tasks in Meta-World.
arXiv Detail & Related papers (2025-02-28T01:25:37Z) - Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning [51.54046200512198]
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models.
A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation.
To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent.
arXiv Detail & Related papers (2025-01-25T14:24:50Z) - Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension [46.07415235144545]
We address the challenging task of Generalized Referring Expression (GREC)
Existing REC methods face challenges in handling the complex cases encountered in GREC.
We propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G)
arXiv Detail & Related papers (2025-01-02T18:57:59Z) - Multi-branch Collaborative Learning Network for 3D Visual Grounding [66.67647903507927]
3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration.
We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capacity to learn specific information for each task.
arXiv Detail & Related papers (2024-07-07T13:27:14Z) - RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner [16.280644319404946]
Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions.
This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
arXiv Detail & Related papers (2024-02-08T11:40:50Z) - Whether you can locate or not? Interactive Referring Expression
Generation [12.148963878497243]
We propose an Interactive REG (IREG) model that can interact with a real REC model.
IREG outperforms previous state-of-the-art methods on popular evaluation metrics.
arXiv Detail & Related papers (2023-08-19T10:53:32Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Towards Unifying Reference Expression Generation and Comprehension [22.72363956296498]
We propose a unified model for REG and REC, named UniRef.
It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention.
We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora.
arXiv Detail & Related papers (2022-10-24T09:53:41Z) - Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised
Referring Expression Grounding [214.8003571700285]
Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression.
We design an entity-enhanced adaptive reconstruction network (EARN)
EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction.
arXiv Detail & Related papers (2022-07-18T05:30:45Z) - Weakly Supervised Disentangled Representation for Goal-conditioned
Reinforcement Learning [15.698612710580447]
We propose a skill learning framework DR-GRL that aims to improve the sample efficiency and policy generalization.
In a weakly supervised manner, we propose a Spatial Transform AutoEncoder (STAE) to learn an interpretable and controllable representation.
We empirically demonstrate that DR-GRL significantly outperforms the previous methods in sample efficiency and policy generalization.
arXiv Detail & Related papers (2022-02-28T09:05:14Z) - On Exploring Pose Estimation as an Auxiliary Learning Task for
Visible-Infrared Person Re-identification [66.58450185833479]
In this paper, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework.
By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features.
Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-01-11T09:44:00Z) - Return-Based Contrastive Representation Learning for Reinforcement
Learning [126.7440353288838]
We propose a novel auxiliary task that forces the learnt representations to discriminate state-action pairs with different returns.
Our algorithm outperforms strong baselines on complex tasks in Atari games and DeepMind Control suite.
arXiv Detail & Related papers (2021-02-22T13:04:18Z) - Multi-task Collaborative Network for Joint Referring Expression
Comprehension and Segmentation [135.67558811281984]
We propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning offerring expression comprehension (REC) and segmentation (RES)
In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent.
We address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS)
arXiv Detail & Related papers (2020-03-19T14:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.