Related papers: Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation

Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation

URL: http://arxiv.org/abs/2211.15603v2
Date: Wed, 30 Nov 2022 13:13:29 GMT
Title: Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation
Authors: Sai Shashank Kalakonda, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla
Abstract summary: Action-GPT is a framework for incorporating Large Language Models into text-based action generation models. We show that utilizing detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces.
Score: 8.753131760384964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Action-GPT, a plug and play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. Our experiments show qualitative and quantitative improvement in the quality of synthesized motions produced by recent text-to-motion models. Code, pretrained models and sample videos will be made available at https://actiongpt.github.io

Related papers

Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature. The recent success of large language models (LLMs) showcases the power of decoder-only transformers. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z)
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models [59.10171699717122]
MoTrans is a customized motion transfer method enabling video generation of similar motion in new context. multimodal representations from recaptioned prompt and video frames promote the modeling of appearance. Our method effectively learns specific motion pattern from singular or multiple reference videos.
arXiv Detail & Related papers (2024-12-02T10:07:59Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM) It supports multimodal control conditions through pre-trained Large Language Models (LLMs) It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z)
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion [8.94802080815133]
MoRAG is a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. We create diverse samples through the spatial composition of the retrieved motions. Our framework can serve as a plug-and-play module, improving the performance of motion diffusion models.
arXiv Detail & Related papers (2024-09-18T17:03:30Z)
MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agent is a conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text.
arXiv Detail & Related papers (2024-05-27T09:57:51Z)
Aligning Actions and Walking to LLM-Generated Textual Descriptions [3.1049440318608568]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns.
arXiv Detail & Related papers (2024-04-18T13:56:03Z)
CoMo: Controllable Motion Generation through Language Guided Pose Code Editing [57.882299081820626]
We introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions. CoMo decomposes motions into discrete and semantically meaningful pose codes. It autoregressively generates sequences of pose codes, which are then decoded into 3D motions.
arXiv Detail & Related papers (2024-03-20T18:11:10Z)
Motion Generation from Fine-grained Textual Descriptions [29.033358642532722]
We build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D. We design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines.
arXiv Detail & Related papers (2024-03-20T11:38:30Z)
OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers [45.808597624491156]
We present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts. At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits. At the fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information.
arXiv Detail & Related papers (2023-12-14T14:31:40Z)
Real-time Animation Generation and Control on Rigged Models via Large Language Models [50.034712575541434]
We introduce a novel method for real-time animation control and generation on rigged models using natural language input. We embed a large language model (LLM) in Unity to output structured texts that can be parsed into diverse and realistic animations.
arXiv Detail & Related papers (2023-10-27T01:36:35Z)
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions. We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z)
Compositional Video Synthesis with Action Graphs [112.94651460161992]
Videos of actions are complex signals containing rich compositional structure in space and time. We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
arXiv Detail & Related papers (2020-06-27T09:39:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.