Towards Omni-supervised Referring Expression Segmentation
- URL: http://arxiv.org/abs/2311.00397v2
- Date: Mon, 27 Nov 2023 09:02:06 GMT
- Title: Towards Omni-supervised Referring Expression Segmentation
- Authors: Minglang Huang, Yiyi Zhou, Gen Luo, Guannan Jiang, Weilin Zhuang,
Xiaoshuai Sun
- Abstract summary: Referring Expression (RES) is an emerging task in computer vision, which segments the target instances in images based on text descriptions.
We propose a new learning task for RES called Omni-supervised Referring Expression (Omni-RES), which aims to make full use of unlabeled, fully labeled and weakly labeled data.
- Score: 36.0543534772681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Segmentation (RES) is an emerging task in computer
vision, which segments the target instances in images based on text
descriptions. However, its development is plagued by the expensive segmentation
labels. To address this issue, we propose a new learning task for RES called
Omni-supervised Referring Expression Segmentation (Omni-RES), which aims to
make full use of unlabeled, fully labeled and weakly labeled data, e.g.,
referring points or grounding boxes, for efficient RES training. To accomplish
this task, we also propose a novel yet strong baseline method for Omni-RES
based on the recently popular teacher-student learning, where the weak labels
are not directly transformed into supervision signals but used as a yardstick
to select and refine high-quality pseudo-masks for teacher-student learning. To
validate the proposed Omni-RES method, we apply it to a set of state-of-the-art
RES models and conduct extensive experiments on a bunch of RES datasets. The
experimental results yield the obvious merits of Omni-RES than the
fully-supervised and semi-supervised training schemes. For instance, with only
10% fully labeled data, Omni-RES can help the base model achieve 100% fully
supervised performance, and it also outperform the semi-supervised alternative
by a large margin, e.g., +14.93% on RefCOCO and +14.95% on RefCOCO+,
respectively. More importantly, Omni-RES also enable the use of large-scale
vision-langauges like Visual Genome to facilitate low-cost RES training, and
achieve new SOTA performance of RES, e.g., 80.66 on RefCOCO.
Related papers
- ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision.
This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline.
Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z) - SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text.
We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z) - Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment [126.34547428473968]
Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios.
We propose a low-redundant alignment method named textbfALLO, focusing on optimizing the most related neurons with the most useful supervised signals.
Experimental results on 10 datasets have shown the effectiveness of ALLO.
arXiv Detail & Related papers (2024-06-18T13:34:40Z) - SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation [66.92696817276288]
SemiRES is a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES.
SemiRES incorporates the Segment Anything Model (SAM), renowned for its precise boundary demarcation.
In instances where a precise mask cannot be matched from the available candidates, we develop the Pixel-Wise Adjustment (PWA) strategy.
arXiv Detail & Related papers (2024-06-03T15:42:30Z) - RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner [16.280644319404946]
Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions.
This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
arXiv Detail & Related papers (2024-02-08T11:40:50Z) - Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation [38.0788558329856]
We build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions.
Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task.
arXiv Detail & Related papers (2023-12-13T09:29:45Z) - GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES)
GRES allows expressions to refer to an arbitrary number of target objects.
We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z) - OPERA: Omni-Supervised Representation Learning with Hierarchical
Supervisions [94.31804364707575]
We propose Omni-suPErvised Representation leArning with hierarchical supervisions (OPERA) as a solution.
We extract a set of hierarchical proxy representations for each image and impose self and full supervisions on the corresponding proxy representations.
Experiments on both convolutional neural networks and vision transformers demonstrate the superiority of OPERA in image classification, segmentation, and object detection.
arXiv Detail & Related papers (2022-10-11T15:51:31Z) - Multi-task Collaborative Network for Joint Referring Expression
Comprehension and Segmentation [135.67558811281984]
We propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning offerring expression comprehension (REC) and segmentation (RES)
In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent.
We address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS)
arXiv Detail & Related papers (2020-03-19T14:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.