Towards Robust Referring Image Segmentation
- URL: http://arxiv.org/abs/2209.09554v2
- Date: Sun, 23 Jul 2023 10:27:35 GMT
- Title: Towards Robust Referring Image Segmentation
- Authors: Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, Dacheng
Tao
- Abstract summary: Referring Image (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions.
We propose a new formulation of RIS, named Robust Referring Image (R-RIS)
We create three R-RIS datasets by augmenting existing RIS datasets with negative sentences.
We propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module.
- Score: 80.53860642199412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Image Segmentation (RIS) is a fundamental vision-language task that
outputs object masks based on text descriptions. Many works have achieved
considerable progress for RIS, including different fusion method designs. In
this work, we explore an essential question, ``What if the text description is
wrong or misleading?'' For example, the described objects are not in the image.
We term such a sentence as a negative sentence. However, existing solutions for
RIS cannot handle such a setting. To this end, we propose a new formulation of
RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regular positive text inputs. To
facilitate this new task, we create three R-RIS datasets by augmenting existing
RIS datasets with negative sentences and propose new metrics to evaluate both
types of inputs in a unified manner. Furthermore, we propose a new
transformer-based model, called RefSegformer, with a token-based vision and
language fusion module. Our design can be easily extended to our R-RIS setting
by adding extra blank tokens. Our proposed RefSegformer achieves
state-of-the-art results on both RIS and R-RIS datasets, establishing a solid
baseline for both settings. Our project page is at
\url{https://github.com/jianzongwu/robust-ref-seg}.
Related papers
- iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval [26.101116761577796]
Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption.
We introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset.
We present an open-domain benchmarking dataset named CIRCO, where each query is labeled with multiple ground truths and a semantic categorization.
arXiv Detail & Related papers (2024-05-05T14:39:06Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Towards Complex-query Referring Image Segmentation: A Novel Benchmark [42.263084522244796]
We propose a new RIS benchmark with complex queries, namely textbfRIS-CQ.
The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific and informative queries.
We present a nichetargeting method to better task the RIS-CQ, called dual-modality graph alignment model (textbftextscDuMoGa)
arXiv Detail & Related papers (2023-09-29T12:58:13Z) - Referring Image Segmentation Using Text Supervision [44.27304699305985]
Existing Referring Image (RIS) methods typically require expensive pixel-level or box-level annotations for supervision.
We propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process.
Our framework achieves promising performances to existing fully-supervised RIS methods while outperforming state-of-the-art weakly-supervised methods adapted from related areas.
arXiv Detail & Related papers (2023-08-28T13:40:47Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Towards Robust Referring Video Object Segmentation with Cyclic
Relational Consensus [42.14174599341824]
Referring Video Object (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video.
In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches.
arXiv Detail & Related papers (2022-07-04T05:08:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.