CLearViD: Curriculum Learning for Video Description
- URL: http://arxiv.org/abs/2311.04480v1
- Date: Wed, 8 Nov 2023 06:20:32 GMT
- Title: CLearViD: Curriculum Learning for Video Description
- Authors: Cheng-Yu Chuang, Pooyan Fazli
- Abstract summary: Video description entails automatically generating coherent natural language sentences that narrate the content of a given video.
We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task.
The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.
- Score: 3.5293199207536627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video description entails automatically generating coherent natural language
sentences that narrate the content of a given video. We introduce CLearViD, a
transformer-based model for video description generation that leverages
curriculum learning to accomplish this task. In particular, we investigate two
curriculum strategies: (1) progressively exposing the model to more challenging
samples by gradually applying a Gaussian noise to the video data, and (2)
gradually reducing the capacity of the network through dropout during the
training process. These methods enable the model to learn more robust and
generalizable features. Moreover, CLearViD leverages the Mish activation
function, which provides non-linearity and non-monotonicity and helps alleviate
the issue of vanishing gradients. Our extensive experiments and ablation
studies demonstrate the effectiveness of the proposed model. The results on two
datasets, namely ActivityNet Captions and YouCook2, show that CLearViD
significantly outperforms existing state-of-the-art models in terms of both
accuracy and diversity metrics.
Related papers
- VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a novel self-training pipeline for vision-language models (VLMs)
It generates its own training data without extensive manual annotation.
VideoSAVi shows significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Curriculum-Guided Abstractive Summarization [45.57561926145256]
Recent Transformer-based summarization models have provided a promising approach to abstractive summarization.
These models have two shortcomings: (1) they often perform poorly in content selection, and (2) their training strategy is not quite efficient, which restricts model performance.
In this paper, we explore two ways to compensate for these pitfalls. First, we augment the Transformer network with a sentence cross-attention module in the decoder, encouraging more abstraction of salient content.
arXiv Detail & Related papers (2023-02-02T11:09:37Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-Supervised Representation Learning for Detection of ACL Tear Injury
in Knee MR Videos [18.54362818156725]
We propose a self-supervised learning approach to learn transferable features from MR video clips by enforcing the model to learn anatomical features.
To the best of our knowledge, none of the supervised learning models performing injury classification task from MR video provide any explanation for the decisions made by the models.
arXiv Detail & Related papers (2020-07-15T15:35:47Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.