Advancing Referring Expression Segmentation Beyond Single Image
- URL: http://arxiv.org/abs/2305.12452v1
- Date: Sun, 21 May 2023 13:14:28 GMT
- Title: Advancing Referring Expression Segmentation Beyond Single Image
- Authors: Yixuan Wu, Zhao Zhang, Xie Chi, Feng Zhu, Rui Zhao
- Abstract summary: We propose a more realistic and general setting, named Group-wise Referring Expression (GRES)
GRES expands to a collection of related images, allowing the described objects to be present in a subset of input images.
We introduce an elaborately compiled dataset named Grouped Referring (GRD), containing complete group-wise annotations of target objects described by given expressions.
- Score: 12.234097959235417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Segmentation (RES) is a widely explored multi-modal
task, which endeavors to segment the pre-existing object within a single image
with a given linguistic expression. However, in broader real-world scenarios,
it is not always possible to determine if the described object exists in a
specific image. Typically, we have a collection of images, some of which may
contain the described objects. The current RES setting curbs its practicality
in such situations. To overcome this limitation, we propose a more realistic
and general setting, named Group-wise Referring Expression Segmentation (GRES),
which expands RES to a collection of related images, allowing the described
objects to be present in a subset of input images. To support this new setting,
we introduce an elaborately compiled dataset named Grouped Referring Dataset
(GRD), containing complete group-wise annotations of target objects described
by given expressions. We also present a baseline method named Grouped Referring
Segmenter (GRSer), which explicitly captures the language-vision and
intra-group vision-vision interactions to achieve state-of-the-art results on
the proposed GRES and related tasks, such as Co-Salient Object Detection and
RES. Our dataset and codes will be publicly released in
https://github.com/yixuan730/group-res.
Related papers
- GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - RRSIS: Referring Remote Sensing Image Segmentation [25.538406069768662]
Localizing desired objects from remote sensing images is of great use in practical applications.
Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images.
We introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations.
arXiv Detail & Related papers (2023-06-14T16:40:19Z) - Referring Camouflaged Object Detection [97.90911862979355]
Ref-COD aims to segment specified camouflaged objects based on a small set of referring images with salient target objects.
We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios.
arXiv Detail & Related papers (2023-06-13T04:15:37Z) - GRES: Generalized Referring Expression Segmentation [32.12725360752345]
We introduce a new benchmark called Generalized Referring Expression (GRES)
GRES allows expressions to refer to an arbitrary number of target objects.
We construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions.
arXiv Detail & Related papers (2023-06-01T17:57:32Z) - Vision-Language Transformer and Query Generation for Referring
Segmentation [39.01244764840372]
We reformulate referring segmentation as a direct attention problem.
We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
arXiv Detail & Related papers (2021-08-12T07:24:35Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z) - Context-Aware Group Captioning via Self-Attention and Contrastive
Features [31.94715153491951]
We introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images.
To solve this problem, we propose a framework combining self-attention mechanism with contrastive feature construction.
Our datasets are constructed on top of the public Conceptual Captions dataset and our new Stock Captions dataset.
arXiv Detail & Related papers (2020-04-07T20:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.