Visual Consensus Prompting for Co-Salient Object Detection
- URL: http://arxiv.org/abs/2504.14254v1
- Date: Sat, 19 Apr 2025 10:12:39 GMT
- Title: Visual Consensus Prompting for Co-Salient Object Detection
- Authors: Jie Wang, Nana Yu, Zihao Zhang, Yahong Han,
- Abstract summary: We propose an interaction-effective and parameter-efficient concise architecture for the co-salient object detection task.<n>A parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP)<n>OurVCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset)
- Score: 26.820772908765083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.
Related papers
- UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation [9.275683880295874]
Scene Graph Generation (SGG) aims at identifying object entities and reasoning their relationships within a given image.
One-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets.
The challenge in one-stage methods stems from the issue of weak entanglement.
We introduce UniQ, a Unified decoder with task-specific queries architecture.
arXiv Detail & Related papers (2025-01-10T03:38:16Z) - PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides [51.88536367177796]
We propose a two-stage, edit-based approach inspired by human drafts for automatically generating presentations.<n>PWTAgent first analyzes references to extract slide-level functional types and content schemas, then generates editing actions based on selected reference slides.<n>PWTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
arXiv Detail & Related papers (2025-01-07T16:53:01Z) - CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection [6.017022924697519]
We propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios.<n>Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual.<n>In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt.
arXiv Detail & Related papers (2024-12-13T02:36:29Z) - Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding [14.175444025026508]
Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring chain-of-thought (CoT) prompting.
generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference.
We propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning.
arXiv Detail & Related papers (2024-09-13T06:29:20Z) - EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration [63.112790050749695]
We introduce EAGER, a novel generative recommendation framework that seamlessly integrates both behavioral and semantic information.
We validate the effectiveness of EAGER on four public benchmarks, demonstrating its superior performance compared to existing methods.
arXiv Detail & Related papers (2024-06-20T06:21:56Z) - Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance [62.15866177242207]
We show that through constructing a subject-agnostic condition, one could obtain outputs consistent with both the given subject and input text prompts.
Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements.
arXiv Detail & Related papers (2024-05-02T15:03:41Z) - PromptSum: Parameter-Efficient Controllable Abstractive Summarization [4.145362426026615]
We introduce PromptSum, a method combining PT with a multi-task objective and discrete entity prompts for abstractive summarization.
Our model competitive ROUGE results on popular abstractive summarization benchmarks coupled with a strong level of controllability through entities.
arXiv Detail & Related papers (2023-08-06T13:54:14Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - Consistency-guided Prompt Learning for Vision-Language Models [23.4909421082857]
We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models.
Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting.
arXiv Detail & Related papers (2023-06-01T23:20:47Z) - Prompt-Matched Semantic Segmentation [96.99924127527002]
The objective of this work is to explore how to effectively adapt pre-trained foundation models to various downstream tasks of image semantic segmentation.
We propose a novel Inter-Stage Prompt-Matched Framework, which maintains the original structure of the foundation model while generating visual prompts adaptively for task-oriented tuning.
A lightweight module termed Semantic-aware Prompt Matcher is then introduced to hierarchically interpolate between two stages to learn reasonable prompts for each specific task.
arXiv Detail & Related papers (2022-08-22T09:12:53Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.