CLearViD: Curriculum Learning for Video Description
- URL: http://arxiv.org/abs/2311.04480v1
- Date: Wed, 8 Nov 2023 06:20:32 GMT
- Title: CLearViD: Curriculum Learning for Video Description
- Authors: Cheng-Yu Chuang, Pooyan Fazli
- Abstract summary: Video description entails automatically generating coherent natural language sentences that narrate the content of a given video.
We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task.
The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.
- Score: 3.5293199207536627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video description entails automatically generating coherent natural language
sentences that narrate the content of a given video. We introduce CLearViD, a
transformer-based model for video description generation that leverages
curriculum learning to accomplish this task. In particular, we investigate two
curriculum strategies: (1) progressively exposing the model to more challenging
samples by gradually applying a Gaussian noise to the video data, and (2)
gradually reducing the capacity of the network through dropout during the
training process. These methods enable the model to learn more robust and
generalizable features. Moreover, CLearViD leverages the Mish activation
function, which provides non-linearity and non-monotonicity and helps alleviate
the issue of vanishing gradients. Our extensive experiments and ablation
studies demonstrate the effectiveness of the proposed model. The results on two
datasets, namely ActivityNet Captions and YouCook2, show that CLearViD
significantly outperforms existing state-of-the-art models in terms of both
accuracy and diversity metrics.
Related papers
- T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals.
We highlight the crucial importance of tailoring datasets to specific learning objectives.
We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z) - Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks.
These models can produce hallucinations, particularly in domains with incomplete knowledge.
We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z) - Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Curriculum-Guided Abstractive Summarization [45.57561926145256]
Recent Transformer-based summarization models have provided a promising approach to abstractive summarization.
These models have two shortcomings: (1) they often perform poorly in content selection, and (2) their training strategy is not quite efficient, which restricts model performance.
In this paper, we explore two ways to compensate for these pitfalls. First, we augment the Transformer network with a sentence cross-attention module in the decoder, encouraging more abstraction of salient content.
arXiv Detail & Related papers (2023-02-02T11:09:37Z) - Robustness Analysis of Video-Language Models Against Visual and Language
Perturbations [10.862722733649543]
This study is the first extensive study of video-language robustness models against various real-world perturbations.
We propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations.
arXiv Detail & Related papers (2022-07-05T16:26:05Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-Supervised Representation Learning for Detection of ACL Tear Injury
in Knee MR Videos [18.54362818156725]
We propose a self-supervised learning approach to learn transferable features from MR video clips by enforcing the model to learn anatomical features.
To the best of our knowledge, none of the supervised learning models performing injury classification task from MR video provide any explanation for the decisions made by the models.
arXiv Detail & Related papers (2020-07-15T15:35:47Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.