Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
- URL: http://arxiv.org/abs/2501.09688v1
- Date: Thu, 16 Jan 2025 17:40:19 GMT
- Title: Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
- Authors: Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim,
- Abstract summary: Open-Vocabulary Part (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories.
We identify two primary challenges in OVPS: the difficulty in aligning part-level image-text correspondence, and the lack of structural understanding in segmenting object parts.
We propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO.
- Score: 24.071471822239854
- License:
- Abstract: Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
Related papers
- A Bottom-Up Approach to Class-Agnostic Image Segmentation [4.086366531569003]
We present a novel bottom-up formulation for addressing the class-agnostic segmentation problem.
We supervise our network directly on the projective sphere of its feature space.
Our bottom-up formulation exhibits exceptional generalization capability, even when trained on datasets designed for class-based segmentation.
arXiv Detail & Related papers (2024-09-20T17:56:02Z) - Understanding Multi-Granularity for Open-Vocabulary Part Segmentation [24.071471822239854]
Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities using diverse and previously unseen vocabularies.
Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification.
We propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts.
arXiv Detail & Related papers (2024-06-17T10:11:28Z) - USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation [33.11010205890195]
The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories.
We introduce the Universal Segment Embedding (USE) framework to address this challenge.
This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories.
arXiv Detail & Related papers (2024-06-07T21:41:18Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Mitigating the Effect of Incidental Correlations on Part-based Learning [50.682498099720114]
Part-based representations could be more interpretable and generalize better with limited data.
We present two innovative regularization methods for part-based representations.
We exhibit state-of-the-art (SoTA) performance on few-shot learning tasks on benchmark datasets.
arXiv Detail & Related papers (2023-09-30T13:44:48Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Seg&Struct: The Interplay Between Part Segmentation and Structure
Inference for 3D Shape Parsing [23.8184215719129]
Seg&Struct is a supervised learning framework leveraging the interplay between part segmentation and structure inference.
We present how these two tasks can be best combined while fully utilizing supervision to improve performance.
arXiv Detail & Related papers (2022-11-01T10:59:15Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Self-Supervised Video Object Segmentation via Cutout Prediction and
Tagging [117.73967303377381]
We propose a novel self-supervised Video Object (VOS) approach that strives to achieve better object-background discriminability.
Our approach is based on a discriminative learning loss formulation that takes into account both object and background information.
Our proposed approach, CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS.
arXiv Detail & Related papers (2022-04-22T17:53:27Z) - Affinity-aware Compression and Expansion Network for Human Parsing [6.993481561132318]
ACENet achieves new state-of-the-art performance on the challenging LIP and Pascal-Person-Part datasets.
58.1% mean IoU is achieved on the LIP benchmark.
arXiv Detail & Related papers (2020-08-24T05:16:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.