MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning
- URL: http://arxiv.org/abs/2311.17435v1
- Date: Wed, 29 Nov 2023 08:27:00 GMT
- Title: MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning
- Authors: Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li,
Chung-Ching Lin, Zicheng Liu, Lijuan Wang
- Abstract summary: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD)
MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner.
We introduce the first segment-based evaluator for recurrent text generation.
- Score: 120.95150400119705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal
in-context learning for the generation of audio descriptions (AD). Unlike
previous methods that primarily focused on downstream fine-tuning with short
video clips, MM-Narrator excels in generating precise audio descriptions for
videos of extensive lengths, even beyond hours, in an autoregressive manner.
This capability is made possible by the proposed memory-augmented generation
process, which effectively utilizes both the short-term textual context and
long-term visual memory through an efficient register-and-recall mechanism.
These contextual memories compile pertinent past information, including
storylines and character identities, ensuring an accurate tracking and
depicting of story-coherent and character-centric audio descriptions.
Maintaining the training-free design of MM-Narrator, we further propose a
complexity-based demonstration selection strategy to largely enhance its
multi-step reasoning capability via few-shot multimodal in-context learning
(MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator
consistently outperforms both the existing fine-tuning-based approaches and
LLM-based approaches in most scenarios, as measured by standard evaluation
metrics. Additionally, we introduce the first segment-based evaluator for
recurrent text generation. Empowered by GPT-4, this evaluator comprehensively
reasons and marks AD generation performance in various extendable dimensions.
Related papers
- Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a simple yet effective Multi-modal Spatio-supervised (MSTA) to improve the alignment between representations in the text and vision branches.
We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-valiant, and fully-language learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z) - Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization [33.67670065326008]
This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs)
Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness.
arXiv Detail & Related papers (2024-10-24T10:32:10Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.