Language-free Compositional Action Generation via Decoupling Refinement
- URL: http://arxiv.org/abs/2307.03538v3
- Date: Mon, 8 Jan 2024 14:54:49 GMT
- Title: Language-free Compositional Action Generation via Decoupling Refinement
- Authors: Xiao Liu, Guangyi Chen, Yansong Tang, Guangrun Wang, Xiao-Ping Zhang,
Ser-Nam Lim
- Abstract summary: We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
- Score: 67.50452446686725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composing simple elements into complex concepts is crucial yet challenging,
especially for 3D action generation. Existing methods largely rely on extensive
neural language annotations to discern composable latent semantics, a process
that is often costly and labor-intensive. In this study, we introduce a novel
framework to generate compositional actions without reliance on language
auxiliaries. Our approach consists of three main components: Action Coupling,
Conditional Action Generation, and Decoupling Refinement. Action Coupling
utilizes an energy model to extract the attention masks of each sub-action,
subsequently integrating two actions using these attentions to generate
pseudo-training examples. Then, we employ a conditional generative model, CVAE,
to learn a latent space, facilitating the diverse generation. Finally, we
propose Decoupling Refinement, which leverages a self-supervised pre-trained
model MAE to ensure semantic consistency between the sub-actions and
compositional actions. This refinement process involves rendering generated 3D
actions into 2D space, decoupling these images into two sub-segments, using the
MAE model to restore the complete image from sub-segments, and constraining the
recovered images to match images rendered from raw sub-actions. Due to the lack
of existing datasets containing both sub-actions and compositional actions, we
created two new datasets, named HumanAct-C and UESTC-C, and present a
corresponding evaluation metric. Both qualitative and quantitative assessments
are conducted to show our efficacy.
Related papers
- BIFRÖST: 3D-Aware Image compositing with Language Instructions [27.484947109237964]
Bifr"ost is a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition.
Bifr"ost addresses issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process.
arXiv Detail & Related papers (2024-10-24T18:35:12Z) - Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective [26.479602180023125]
The Linear Complexity Sequence Model (LCSM) unites various sequence modeling techniques with linear complexity.
We segment the modeling processes of these models into three distinct stages: Expand, Oscillation, and Shrink.
We perform experiments to analyze the impact of different stage settings on language modeling and retrieval tasks.
arXiv Detail & Related papers (2024-05-27T17:38:55Z) - Part123: Part-aware 3D Reconstruction from a Single-view Image [54.589723979757515]
Part123 is a novel framework for part-aware 3D reconstruction from a single-view image.
We introduce contrastive learning into a neural rendering framework to learn a part-aware feature space.
A clustering-based algorithm is also developed to automatically derive 3D part segmentation results from the reconstructed models.
arXiv Detail & Related papers (2024-05-27T07:10:21Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - Zero-Shot Image Harmonization with Generative Model Prior [22.984119094424056]
We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images.
We introduce a fully modularized framework inspired by human behavior.
We present compelling visual results across diverse scenes and objects, along with a user study validating our approach.
arXiv Detail & Related papers (2023-07-17T00:56:21Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Part-aware Prototypical Graph Network for One-shot Skeleton-based Action
Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes.
We propose a part-aware prototypical representation for one-shot skeleton-based action recognition.
We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.