Related papers: RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

URL: http://arxiv.org/abs/2402.05589v2
Date: Sun, 11 Feb 2024 10:27:04 GMT
Title: RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner
Authors: Ying Zang, Chenglong Fu, Runlong Cao, Didi Zhu, Min Zhang, Wenjun Hu, Lanyun Zhu, Tianrun Chen
Abstract summary: Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
Score: 16.280644319404946
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation. Extensive validation on multiple RES datasets demonstrates that RESMatch significantly outperforms baseline approaches, establishing a new state-of-the-art. Although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing the challenges including the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudo-label quality and strong-weak supervision. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.

Related papers

ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation [21.87321809019825]
Referring Expression (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions.<n>textbfmodel is a novel RES framework integrating textbfEntropy-textbfBased Point textbfDiscovery (textbfEBD) and textbfVision-textbfBased textbfReasoning (textbfVBR)<n>model implements a coarse-to
arXiv Detail & Related papers (2026-01-23T01:56:04Z)
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [58.80001825332851]
Referring Image (RIS) aims to segment the target object in an image given a natural language expression.<n>Recent methods predominantly focus on simple expressions like "red car" or "left girl"
arXiv Detail & Related papers (2025-10-11T10:50:58Z)
Proxy-Embedding as an Adversarial Teacher: An Embedding-Guided Bidirectional Attack for Referring Expression Segmentation Models [7.064823891326925]
Referring Expression (RES) enables precise object segmentation in images based on natural language descriptions.<n>Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored.<n>We present PEAT, an Embedding-Guided Bidirectional Attack for RES models.
arXiv Detail & Related papers (2025-06-19T09:14:04Z)
Segment Concealed Objects with Incomplete Supervision [63.637733655439334]
Incompletely-Supervised Concealed Object (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments.<n>This task remains highly challenging due to the limited supervision provided by the incompletely annotated training data.<n>In this paper, we introduce the first unified method for ISCOS to address these challenges.
arXiv Detail & Related papers (2025-06-10T16:25:15Z)
SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data [4.962252439662465]
We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets.<n>Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES.<n>To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data.
arXiv Detail & Related papers (2025-05-23T10:05:16Z)
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation [11.205928115216]
We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES)<n>Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is prompting attribute.<n>We contribute a new benchmark dataset to offer 3K carefully curated RES instances to assess part-level, arbitrary RES solutions.
arXiv Detail & Related papers (2025-05-03T15:19:20Z)
CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass [3.0566617373924325]
Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field. We propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models. We show that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption.
arXiv Detail & Related papers (2025-05-01T08:27:14Z)
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning [23.129998055266245]
Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information. We introduce a simple yet effective approach called textbfAugmenting Dtextbfiscriminative textbfRichness via Diffusions (AiR)
arXiv Detail & Related papers (2025-04-16T10:09:45Z)
SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition [4.670023983240585]
We propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines.
arXiv Detail & Related papers (2025-03-24T09:08:14Z)
Training Strategies for Isolated Sign Language Recognition [72.27323884094953]
This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. We achieve a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution.
arXiv Detail & Related papers (2024-12-16T08:37:58Z)
ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline. Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z)
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation [37.96005100341482]
Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification. We propose a textbfCounting-Aware textbfHierarchical textbfDecoding framework (CoHD) for GRES.
arXiv Detail & Related papers (2024-05-24T15:53:59Z)
Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules. We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z)
Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z)
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation [38.0788558329856]
We build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions. Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task.
arXiv Detail & Related papers (2023-12-13T09:29:45Z)
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z)
BERM: Training the Balanced and Extractable Representation for Matching to Improve Generalization Ability of Dense Retrieval [54.66399120084227]
We propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM. Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets.
arXiv Detail & Related papers (2023-05-18T15:43:09Z)
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches. This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.