MuLTI: Efficient Video-and-Language Understanding with Text-Guided
MultiWay-Sampler and Multiple Choice Modeling
- URL: http://arxiv.org/abs/2303.05707v2
- Date: Fri, 1 Mar 2024 02:32:38 GMT
- Title: MuLTI: Efficient Video-and-Language Understanding with Text-Guided
MultiWay-Sampler and Multiple Choice Modeling
- Authors: Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi
- Abstract summary: This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model.
We design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules.
We also propose a new pretraining task named Multiple Choice Modeling.
- Score: 7.737755720567113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-and-language understanding has a variety of applications in the
industry, such as video question answering, text-video retrieval, and
multi-label classification. Existing video-and-language understanding methods
generally adopt heavy multi-modal encoders and feature fusion modules, which
consume high computational costs. Specially, they have difficulty dealing with
dense video frames or long text prevalent in industrial applications. This
paper proposes MuLTI, a highly accurate and efficient video-and-language
understanding model that achieves efficient and effective feature fusion and
rapid adaptation to downstream tasks. Specifically, we design a Text-Guided
MultiWay-Sampler based on adapt-pooling residual mapping and self-attention
modules to sample long sequences and fuse multi-modal features, which reduces
the computational costs and addresses performance degradation caused by
previous samplers. Therefore, MuLTI can handle longer sequences with limited
computational costs. Then, to further enhance the model's performance and fill
in the lack of pretraining tasks in the video question answering, we propose a
new pretraining task named Multiple Choice Modeling. This task bridges the gap
between pretraining and downstream tasks and improves the model's ability to
align video and text features. Benefiting from the efficient feature fusion
module and the new pretraining task, MuLTI achieves state-of-the-art
performance on multiple datasets. Implementation and pretrained models will be
released.
Related papers
- Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.