Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
- URL: http://arxiv.org/abs/2501.09688v1
- Date: Thu, 16 Jan 2025 17:40:19 GMT
- Title: Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
- Authors: Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim,
- Abstract summary: Open-Vocabulary Part (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories.<n>We identify two primary challenges in OVPS: the difficulty in aligning part-level image-text correspondence, and the lack of structural understanding in segmenting object parts.<n>We propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO.
- Score: 24.071471822239854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
Related papers
- Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.
Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.
This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z) - One-shot In-context Part Segmentation [97.77292483684877]
We present the One-shot In-context Part (OIParts) framework to tackle the challenges of part segmentation.
Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient.
We have achieved remarkable segmentation performance across diverse object categories.
arXiv Detail & Related papers (2025-03-03T03:50:54Z) - A Bottom-Up Approach to Class-Agnostic Image Segmentation [4.086366531569003]
We present a novel bottom-up formulation for addressing the class-agnostic segmentation problem.
We supervise our network directly on the projective sphere of its feature space.
Our bottom-up formulation exhibits exceptional generalization capability, even when trained on datasets designed for class-based segmentation.
arXiv Detail & Related papers (2024-09-20T17:56:02Z) - Understanding Multi-Granularity for Open-Vocabulary Part Segmentation [24.071471822239854]
Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities using diverse and previously unseen vocabularies.<n>Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification.<n>We propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts.
arXiv Detail & Related papers (2024-06-17T10:11:28Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Mitigating the Effect of Incidental Correlations on Part-based Learning [50.682498099720114]
Part-based representations could be more interpretable and generalize better with limited data.
We present two innovative regularization methods for part-based representations.
We exhibit state-of-the-art (SoTA) performance on few-shot learning tasks on benchmark datasets.
arXiv Detail & Related papers (2023-09-30T13:44:48Z) - Hierarchical Open-vocabulary Universal Image Segmentation [48.008887320870244]
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.
We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff"
Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
arXiv Detail & Related papers (2023-07-03T06:02:15Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Seg&Struct: The Interplay Between Part Segmentation and Structure
Inference for 3D Shape Parsing [23.8184215719129]
Seg&Struct is a supervised learning framework leveraging the interplay between part segmentation and structure inference.
We present how these two tasks can be best combined while fully utilizing supervision to improve performance.
arXiv Detail & Related papers (2022-11-01T10:59:15Z) - Self-Supervised Video Object Segmentation via Cutout Prediction and
Tagging [117.73967303377381]
We propose a novel self-supervised Video Object (VOS) approach that strives to achieve better object-background discriminability.
Our approach is based on a discriminative learning loss formulation that takes into account both object and background information.
Our proposed approach, CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS.
arXiv Detail & Related papers (2022-04-22T17:53:27Z) - Affinity-aware Compression and Expansion Network for Human Parsing [6.993481561132318]
ACENet achieves new state-of-the-art performance on the challenging LIP and Pascal-Person-Part datasets.
58.1% mean IoU is achieved on the LIP benchmark.
arXiv Detail & Related papers (2020-08-24T05:16:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.