Related papers: Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

URL: http://arxiv.org/abs/2405.16728v1
Date: Sun, 26 May 2024 23:56:45 GMT
Title: Towards Multi-Task Multi-Modal Models: A Video Generative Perspective
Authors: Lijun Yu,
Abstract summary: This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
Score: 5.495245220300184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advancements in language foundation models have primarily fueled the recent surge in artificial intelligence. In contrast, generative learning of non-textual modalities, especially videos, significantly trails behind language modeling. This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions, as well as for understanding and compression applications. Given the high dimensionality of visual data, we pursue concise and accurate latent representations. Our video-native spatial-temporal tokenizers preserve high fidelity. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Furthermore, our scalable visual token representation proves beneficial across generation, compression, and understanding tasks. This achievement marks the first instances of language models surpassing diffusion models in visual synthesis and a video tokenizer outperforming industry-standard codecs. Within these multi-modal latent spaces, we study the design of multi-task generative models. Our masked multi-task transformer excels at the quality, efficiency, and flexibility of video generation. We enable a frozen language model, trained solely on text, to generate visual content. Finally, we build a scalable generative multi-modal transformer trained from scratch, enabling the generation of videos containing high-fidelity motion with the corresponding audio given diverse conditions. Throughout the course, we have shown the effectiveness of integrating multiple tasks, crafting high-fidelity latent representation, and generating multiple modalities. This work suggests intriguing potential for future exploration in generating non-textual data and enabling real-time, interactive experiences across various media forms.

Related papers

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z)
Show-o2: Improved Native Unified Multimodal Models [21.78513101265258]
Show-o2 is a native unified multimodal models that leverage autoregressive modeling and flow matching.<n>Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion.
arXiv Detail & Related papers (2025-06-18T15:39:15Z)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z)
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM) Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models. Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action. We train our model from scratch on a large multimodal pre-training corpus from diverse sources. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks. We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.