MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning
- URL: http://arxiv.org/abs/2311.17435v1
- Date: Wed, 29 Nov 2023 08:27:00 GMT
- Title: MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning
- Authors: Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li,
Chung-Ching Lin, Zicheng Liu, Lijuan Wang
- Abstract summary: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD)
MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner.
We introduce the first segment-based evaluator for recurrent text generation.
- Score: 120.95150400119705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal
in-context learning for the generation of audio descriptions (AD). Unlike
previous methods that primarily focused on downstream fine-tuning with short
video clips, MM-Narrator excels in generating precise audio descriptions for
videos of extensive lengths, even beyond hours, in an autoregressive manner.
This capability is made possible by the proposed memory-augmented generation
process, which effectively utilizes both the short-term textual context and
long-term visual memory through an efficient register-and-recall mechanism.
These contextual memories compile pertinent past information, including
storylines and character identities, ensuring an accurate tracking and
depicting of story-coherent and character-centric audio descriptions.
Maintaining the training-free design of MM-Narrator, we further propose a
complexity-based demonstration selection strategy to largely enhance its
multi-step reasoning capability via few-shot multimodal in-context learning
(MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator
consistently outperforms both the existing fine-tuning-based approaches and
LLM-based approaches in most scenarios, as measured by standard evaluation
metrics. Additionally, we introduce the first segment-based evaluator for
recurrent text generation. Empowered by GPT-4, this evaluator comprehensively
reasons and marks AD generation performance in various extendable dimensions.
Related papers
- Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.
CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)
We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z) - SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation [92.73405185996315]
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation.
Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering.
We introduce a model-agnostic iterative self-feedback framework (SILMM) that can enable LMMs to provide helpful and scalable self-improvement and optimize text-image alignment.
arXiv Detail & Related papers (2024-12-08T05:28:08Z) - Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a simple yet effective Multi-modal Spatio-supervised (MSTA) to improve the alignment between representations in the text and vision branches.
We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-valiant, and fully-language learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.