PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus
- URL: http://arxiv.org/abs/2405.16094v2
- Date: Mon, 3 Jun 2024 08:27:09 GMT
- Title: PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus
- Authors: Zhaochen Liu, Limeng Qiao, Xiangxiang Chu, Tingting Jiang,
- Abstract summary: We propose the first SAM-based amodal segmentation approach, PLUG.
In the region level, due to the association and division in visible and occluded areas, inmodal and amodal regions are assigned as the focuses of distinct branches to avoid mutual disturbance.
In the point level, we introduce the concept of uncertainty to explicitly assist the model in identifying and focusing on ambiguous points.
- Score: 19.25678147515461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aiming to predict the complete shapes of partially occluded objects, amodal segmentation is an important step towards visual intelligence. With crucial significance, practical prior knowledge derives from sufficient training, while limited amodal annotations pose challenges to achieve better performance. To tackle this problem, utilizing the mighty priors accumulated in the foundation model, we propose the first SAM-based amodal segmentation approach, PLUG. Methodologically, a novel framework with hierarchical focus is presented to better adapt the task characteristics and unleash the potential capabilities of SAM. In the region level, due to the association and division in visible and occluded areas, inmodal and amodal regions are assigned as the focuses of distinct branches to avoid mutual disturbance. In the point level, we introduce the concept of uncertainty to explicitly assist the model in identifying and focusing on ambiguous points. Guided by the uncertainty map, a computation-economic point loss is applied to improve the accuracy of predicted boundaries. Experiments are conducted on several prominent datasets, and the results show that our proposed method outperforms existing methods with large margins. Even with fewer total parameters, our method still exhibits remarkable advantages.
Related papers
- Unified modality separation: A vision-language framework for unsupervised domain adaptation [60.8391821117794]
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains.<n>We propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components.<n>Our methods achieve up to 9% performance gain with 9 times of computational efficiencies.
arXiv Detail & Related papers (2025-08-07T02:51:10Z) - MIRRAMS: Towards Training Models Robust to Missingness Distribution Shifts [2.5357049657770516]
In real-world data analysis, missingness distributional shifts between training and test input datasets frequently occur.<n>We propose a novel deep learning framework designed to address such shifts in missingness distributions.<n>Our approach achieves state-of-the-art performance even without missing data and can be naturally extended to address semi-supervised learning tasks.
arXiv Detail & Related papers (2025-07-11T03:03:30Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - How Far Are We from Predicting Missing Modalities with Foundation Models? [31.853781353441242]
Current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities.<n>This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features.<n> Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.
arXiv Detail & Related papers (2025-06-04T03:22:44Z) - Confidence-Aware Self-Distillation for Multimodal Sentiment Analysis with Incomplete Modalities [15.205192581534973]
Multimodal sentiment analysis aims to understand human sentiment through multimodal data.<n>Existing methods for handling modality missingness are based on data reconstruction or common subspace projections.<n>We propose a Confidence-Aware Self-Distillation (CASD) strategy that effectively incorporates multimodal probabilistic embeddings.
arXiv Detail & Related papers (2025-06-02T09:48:41Z) - A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation [0.0]
Advancements in image segmentation play an integral role within the greater scope of Deep Learning-based computer vision.
Uncertainty quantification has been extensively studied within this context, enabling expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision making.
This work provides a comprehensive overview of probabilistic segmentation by discussing fundamental concepts in uncertainty that govern advancements in the field and the application to various tasks.
arXiv Detail & Related papers (2024-11-25T13:26:09Z) - Minimizing Embedding Distortion for Robust Out-of-Distribution Performance [1.0923877073891446]
We introduce a novel approach we call "similarity loss", which can be incorporated into the fine-tuning process of any task.
We evaluate our approach on two diverse tasks: image classification on satellite imagery and face recognition.
arXiv Detail & Related papers (2024-09-11T19:22:52Z) - Diffusion Features to Bridge Domain Gap for Semantic Segmentation [2.8616666231199424]
This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently.
By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it.
arXiv Detail & Related papers (2024-06-02T15:33:46Z) - Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance [59.71186244597394]
We introduce an effective approach to stabilize the proposal-target matching in point-based methods.
We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization.
We also develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios.
arXiv Detail & Related papers (2024-05-17T07:23:27Z) - A Generalization Theory of Cross-Modality Distillation with Contrastive Learning [49.35244441141323]
Cross-modality distillation arises as an important topic for data modalities containing limited knowledge.
We formulate a general framework of cross-modality contrastive distillation (CMCD), built upon contrastive learning.
Our algorithm outperforms existing algorithms consistently by a margin of 2-3% across diverse modalities and tasks.
arXiv Detail & Related papers (2024-05-06T11:05:13Z) - DiffusionNOCS: Managing Symmetry and Uncertainty in Sim2Real Multi-Modal
Category-level Pose Estimation [20.676510832922016]
We propose a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes.
We introduce critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations.
Despite being trained solely on our generated synthetic data, our approach achieves state-of-the-art performance and unprecedented generalization qualities.
arXiv Detail & Related papers (2024-02-20T01:48:33Z) - Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition.
We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - A Deep Reinforcement Learning Approach to Marginalized Importance
Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution.
We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy.
We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z) - An Information Bottleneck Approach for Controlling Conciseness in
Rationale Extraction [84.49035467829819]
We show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective.
Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale.
arXiv Detail & Related papers (2020-05-01T23:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.