Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation
- URL: http://arxiv.org/abs/2412.11193v1
- Date: Sun, 15 Dec 2024 13:58:37 GMT
- Title: Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation
- Authors: Ling-An Zeng, Guohong Huang, Gaojie Wu, Wei-Shi Zheng,
- Abstract summary: Text-to-motion (T2M) generation plays a significant role in various applications.
Current methods involve a large number of parameters and suffer from slow inference speeds.
We propose a lightweight and fast model named Light-T2M to reduce usage costs.
- Score: 30.05431858162078
- License:
- Abstract: Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we aim to design a lightweight model to reduce usage costs. First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module. Second, we introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count. Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation. By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M. Compared to the state-of-the-art method, MoMask, our Light-T2M model features just 10\% of the parameters (4.48M vs 44.85M) and achieves a 16\% faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of \textbf{0.040} (vs. 0.045) on HumanML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset. The code is available at https://github.com/qinghuannn/light-t2m.
Related papers
- SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training [77.681908636429]
Text-to-image (T2I) models face several limitations, including large model sizes, slow, and low-quality generation on mobile devices.
This paper aims to develop an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms.
arXiv Detail & Related papers (2024-12-12T18:59:53Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.
Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.
Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation [21.242398582282522]
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation.
MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models.
Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy.
Our larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5times$ fewer parameters.
arXiv Detail & Related papers (2024-10-03T01:23:44Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task
Learning [1.4963011898406864]
We introduce AdaMTL, an adaptive framework that learns task-aware inference policies for multi-task learning models.
AdaMTL reduces the computational complexity by 43% while improving the accuracy by 1.32% compared to single-task models.
When deployed on Vuzix M4000 smart glasses, AdaMTL reduces the inference latency and the energy consumption by up to 21.8% and 37.5%, respectively.
arXiv Detail & Related papers (2023-04-17T20:17:44Z) - LiteMuL: A Lightweight On-Device Sequence Tagger using Multi-task
Learning [1.3192560874022086]
LiteMuL is a lightweight on-device sequence tagger that can efficiently process the user conversations using a Multi-Task Learning approach.
Our model is competitive with other MTL approaches for NER and POS tasks while outshines them with a low memory footprint.
arXiv Detail & Related papers (2020-12-15T19:15:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.