Related papers: Towards Omni-supervised Referring Expression Segmentation

Towards Omni-supervised Referring Expression Segmentation

URL: http://arxiv.org/abs/2311.00397v2
Date: Mon, 27 Nov 2023 09:02:06 GMT
Title: Towards Omni-supervised Referring Expression Segmentation
Authors: Minglang Huang, Yiyi Zhou, Gen Luo, Guannan Jiang, Weilin Zhuang, Xiaoshuai Sun
Abstract summary: Referring Expression (RES) is an emerging task in computer vision, which segments the target instances in images based on text descriptions. We propose a new learning task for RES called Omni-supervised Referring Expression (Omni-RES), which aims to make full use of unlabeled, fully labeled and weakly labeled data.
Score: 36.0543534772681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Expression Segmentation (RES) is an emerging task in computer vision, which segments the target instances in images based on text descriptions. However, its development is plagued by the expensive segmentation labels. To address this issue, we propose a new learning task for RES called Omni-supervised Referring Expression Segmentation (Omni-RES), which aims to make full use of unlabeled, fully labeled and weakly labeled data, e.g., referring points or grounding boxes, for efficient RES training. To accomplish this task, we also propose a novel yet strong baseline method for Omni-RES based on the recently popular teacher-student learning, where the weak labels are not directly transformed into supervision signals but used as a yardstick to select and refine high-quality pseudo-masks for teacher-student learning. To validate the proposed Omni-RES method, we apply it to a set of state-of-the-art RES models and conduct extensive experiments on a bunch of RES datasets. The experimental results yield the obvious merits of Omni-RES than the fully-supervised and semi-supervised training schemes. For instance, with only 10% fully labeled data, Omni-RES can help the base model achieve 100% fully supervised performance, and it also outperform the semi-supervised alternative by a large margin, e.g., +14.93% on RefCOCO and +14.95% on RefCOCO+, respectively. More importantly, Omni-RES also enable the use of large-scale vision-langauges like Visual Genome to facilitate low-cost RES training, and achieve new SOTA performance of RES, e.g., 80.66 on RefCOCO.

Related papers

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities [36.506512800685066]
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. Traditional RES methods primarily address object-level grounding. Real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity. We propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks.
arXiv Detail & Related papers (2025-04-02T17:58:05Z)
ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline. Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z)
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text. We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z)
Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment [126.34547428473968]
Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. We propose a low-redundant alignment method named textbfALLO, focusing on optimizing the most related neurons with the most useful supervised signals. Experimental results on 10 datasets have shown the effectiveness of ALLO.
arXiv Detail & Related papers (2024-06-18T13:34:40Z)
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation [66.92696817276288]
SemiRES is a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES. SemiRES incorporates the Segment Anything Model (SAM), renowned for its precise boundary demarcation. In instances where a precise mask cannot be matched from the available candidates, we develop the Pixel-Wise Adjustment (PWA) strategy.
arXiv Detail & Related papers (2024-06-03T15:42:30Z)
RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner [16.280644319404946]
Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
arXiv Detail & Related papers (2024-02-08T11:40:50Z)
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation [38.0788558329856]
We build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions. Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task.
arXiv Detail & Related papers (2023-12-13T09:29:45Z)
GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES) GRES allows expressions to refer to an arbitrary number of target objects. We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z)
OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions [94.31804364707575]
We propose Omni-suPErvised Representation leArning with hierarchical supervisions (OPERA) as a solution. We extract a set of hierarchical proxy representations for each image and impose self and full supervisions on the corresponding proxy representations. Experiments on both convolutional neural networks and vision transformers demonstrate the superiority of OPERA in image classification, segmentation, and object detection.
arXiv Detail & Related papers (2022-10-11T15:51:31Z)
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation [135.67558811281984]
We propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning offerring expression comprehension (REC) and segmentation (RES) In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent. We address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS)
arXiv Detail & Related papers (2020-03-19T14:25:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.