MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning
- URL: http://arxiv.org/abs/2311.17435v1
- Date: Wed, 29 Nov 2023 08:27:00 GMT
- Title: MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning
- Authors: Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li,
Chung-Ching Lin, Zicheng Liu, Lijuan Wang
- Abstract summary: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD)
MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner.
We introduce the first segment-based evaluator for recurrent text generation.
- Score: 120.95150400119705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal
in-context learning for the generation of audio descriptions (AD). Unlike
previous methods that primarily focused on downstream fine-tuning with short
video clips, MM-Narrator excels in generating precise audio descriptions for
videos of extensive lengths, even beyond hours, in an autoregressive manner.
This capability is made possible by the proposed memory-augmented generation
process, which effectively utilizes both the short-term textual context and
long-term visual memory through an efficient register-and-recall mechanism.
These contextual memories compile pertinent past information, including
storylines and character identities, ensuring an accurate tracking and
depicting of story-coherent and character-centric audio descriptions.
Maintaining the training-free design of MM-Narrator, we further propose a
complexity-based demonstration selection strategy to largely enhance its
multi-step reasoning capability via few-shot multimodal in-context learning
(MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator
consistently outperforms both the existing fine-tuning-based approaches and
LLM-based approaches in most scenarios, as measured by standard evaluation
metrics. Additionally, we introduce the first segment-based evaluator for
recurrent text generation. Empowered by GPT-4, this evaluator comprehensively
reasons and marks AD generation performance in various extendable dimensions.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - A Review of Multi-Modal Large Language and Vision Models [1.9685736810241874]
Large Language Models (LLMs) have emerged as a focal point of research and application.
Recently, LLMs have been extended into multi-modal large language models (MM-LLMs)
This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs.
arXiv Detail & Related papers (2024-03-28T15:53:45Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities.
We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [30.284100018891397]
Multi-Modal In-Context Tuning (MMICT) is a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.