MART: Memory-Augmented Recurrent Transformer for Coherent Video
Paragraph Captioning
- URL: http://arxiv.org/abs/2005.05402v1
- Date: Mon, 11 May 2020 20:01:41 GMT
- Title: MART: Memory-Augmented Recurrent Transformer for Coherent Video
Paragraph Captioning
- Authors: Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit
Bansal
- Abstract summary: We propose a new approach called Memory-Augmented Recurrent Transformer (MART)
MART uses a memory module to augment the transformer architecture.
MART generates more coherent and less repetitive paragraph captions than baseline methods.
- Score: 128.36951818335046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating multi-sentence descriptions for videos is one of the most
challenging captioning tasks due to its high requirements for not only visual
relevance but also discourse-based coherence across the sentences in the
paragraph. Towards this goal, we propose a new approach called Memory-Augmented
Recurrent Transformer (MART), which uses a memory module to augment the
transformer architecture. The memory module generates a highly summarized
memory state from the video segments and the sentence history so as to help
better prediction of the next sentence (w.r.t. coreference and repetition
aspects), thus encouraging coherent paragraph generation. Extensive
experiments, human evaluations, and qualitative analyses on two popular
datasets ActivityNet Captions and YouCookII show that MART generates more
coherent and less repetitive paragraph captions than baseline methods, while
maintaining relevance to the input video events. All code is available
open-source at: https://github.com/jayleicn/recurrent-transformer
Related papers
- Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning [42.0725330677271]
We propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module.
Experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios.
arXiv Detail & Related papers (2024-11-06T17:11:44Z) - HMT: Hierarchical Memory Transformer for Long Context Language Processing [35.730941605490194]
Hierarchical Memory Transformer (HMT) is a novel framework that enables and improves models' long-context processing ability.
We show that HMT steadily improves the long-context processing ability of context-constrained and long-context models.
arXiv Detail & Related papers (2024-05-09T19:32:49Z) - Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval [59.47258928867802]
Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database.
This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly.
Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
arXiv Detail & Related papers (2023-10-08T15:04:50Z) - Implicit Memory Transformer for Computationally Efficient Simultaneous
Speech Translation [0.20305676256390928]
We propose an Implicit Memory Transformer that implicitly retains memory through a new left context method.
Experiments on the MuST-C dataset show that the Implicit Memory Transformer provides a substantial speedup on the encoder forward pass.
arXiv Detail & Related papers (2023-07-03T22:20:21Z) - Video Referring Expression Comprehension via Transformer with
Content-aware Query [60.89442448993627]
Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
arXiv Detail & Related papers (2022-10-06T14:45:41Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z) - Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval [155.32369959647437]
Cross-modal video-text retrieval is a challenging task in the field of vision and language.
Existing approaches for this task all focus on how to design encoding model through a hard negative ranking loss.
We propose a novel memory enhanced embedding learning (MEEL) method for videotext retrieval.
arXiv Detail & Related papers (2021-03-29T15:15:09Z) - Exploration of Visual Features and their weighted-additive fusion for
Video Captioning [0.7388859384645263]
Video captioning is a popular task that challenges models to describe events in videos using natural language.
In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context.
arXiv Detail & Related papers (2021-01-14T07:21:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.