Related papers: DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios

DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios

URL: http://arxiv.org/abs/2410.09788v1
Date: Sun, 13 Oct 2024 10:02:58 GMT
Title: DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios
Authors: Siyi Jiao, Wenzheng Zeng, Changxin Gao, Nong Sang,
Abstract summary: We propose DFIMat, a decoupled framework that enables flexible interactive matting. Specifically, we first decouple the task into 2 sub-ones: localizing target instances by understanding scene semantics and the flexible user inputs, and conducting refinement for instance-level matting. We observe a clear performance gain from decoupling, as it makes sub-tasks easier to learn, and the flexible multi-type input further enhances both effectiveness and efficiency.
Score: 32.77825044757212
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interactive portrait matting refers to extracting the soft portrait from a given image that best meets the user's intent through their inputs. Existing methods often underperform in complex scenarios, mainly due to three factors. (1) Most works apply a tightly coupled network that directly predicts matting results, lacking interpretability and resulting in inadequate modeling. (2) Existing works are limited to a single type of user input, which is ineffective for intention understanding and also inefficient for user operation. (3) The multi-round characteristics have been under-explored, which is crucial for user interaction. To alleviate these limitations, we propose DFIMat, a decoupled framework that enables flexible interactive matting. Specifically, we first decouple the task into 2 sub-ones: localizing target instances by understanding scene semantics and the flexible user inputs, and conducting refinement for instance-level matting. We observe a clear performance gain from decoupling, as it makes sub-tasks easier to learn, and the flexible multi-type input further enhances both effectiveness and efficiency. DFIMat also considers the multi-round interaction property, where a contrastive reasoning module is designed to enhance cross-round refinement. Another limitation for multi-person matting task is the lack of training data. We address this by introducing a new synthetic data generation pipeline that can generate much more realistic samples than previous arts. A new large-scale dataset SMPMat is subsequently established. Experiments verify the significant superiority of DFIMat. With it, we also investigate the roles of different input types, providing valuable principles for users. Our code and dataset can be found at https://github.com/JiaoSiyi/DFIMat.

Related papers

SDMatte: Grafting Diffusion Models for Interactive Matting [16.575733536011658]
We propose a diffusion-driven interactive matting model, SDMatte, with three key contributions.<n>First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability.<n>Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte's sensitivity to spatial position information.<n>Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance.
arXiv Detail & Related papers (2025-08-01T09:00:48Z)
LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation [12.89199121698673]
Large language models (LLMs) show significant potential for multi-interest analysis due to their extensive knowledge and powerful reasoning capabilities.<n>We propose an LLM-driven dual-level multi-interest modeling framework for more effective recommendation.<n> Experiments on real-world datasets show the superiority of our approach against state-of-the-art methods.
arXiv Detail & Related papers (2025-07-15T02:13:54Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
MixRec: Heterogeneous Graph Collaborative Filtering [21.96510707666373]
We present a graph collaborative filtering model MixRec to disentangling users' multi-behavior interaction patterns. Our model achieves this by incorporating intent disentanglement and multi-behavior modeling. We also introduce a novel contrastive learning paradigm that adaptively explores the advantages of self-supervised data augmentation.
arXiv Detail & Related papers (2024-12-18T13:12:36Z)
LLM-assisted Explicit and Implicit Multi-interest Learning Framework for Sequential Recommendation [50.98046887582194]
We propose an explicit and implicit multi-interest learning framework to model user interests on two levels: behavior and semantics. The proposed EIMF framework effectively and efficiently combines small models with LLM to improve the accuracy of multi-interest modeling.
arXiv Detail & Related papers (2024-11-14T13:00:23Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input. We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z)
Compressed Interaction Graph based Framework for Multi-behavior Recommendation [46.16750419508853]
It is challenging to explore multi-behavior data due to the unbalanced data distribution and sparse target behavior. We propose CIGF, a Compressed Interaction Graph based Framework, to overcome the above limitations. We propose a Multi-Expert with Separate Input (MESI) network with separate input on the top of CIGCN for multi-task learning.
arXiv Detail & Related papers (2023-03-04T13:41:36Z)
Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis [0.0]
We show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M.
arXiv Detail & Related papers (2022-06-09T21:30:10Z)
Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition. We introduce the object query for segmentation and the attribute query for attribute prediction. For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z)
Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding [0.0]
We propose a flexible model for the task which exploits all available data. The task involves complex relations and to avoid using a large model for video processing specifically, we propose the use of behaviour encoding.
arXiv Detail & Related papers (2021-12-22T19:14:55Z)
Disentangled Graph Collaborative Filtering [100.26835145396782]
Disentangled Graph Collaborative Filtering (DGCF) is a new model for learning informative representations of users and items from interaction data. By modeling a distribution over intents for each user-item interaction, we iteratively refine the intent-aware interaction graphs and representations. DGCF achieves significant improvements over several state-of-the-art models like NGCF, DisenGCN, and MacridVAE.
arXiv Detail & Related papers (2020-07-03T15:37:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.