DL-KDD: Dual-Light Knowledge Distillation for Action Recognition in the Dark
- URL: http://arxiv.org/abs/2406.02468v1
- Date: Tue, 4 Jun 2024 16:38:06 GMT
- Title: DL-KDD: Dual-Light Knowledge Distillation for Action Recognition in the Dark
- Authors: Chi-Jui Chang, Oscar Tai-Yuan Chen, Vincent S. Tseng,
- Abstract summary: We propose a teacher-student video classification framework, named Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD)
This framework enables the model to learn from both original and enhanced video without introducing additional computational cost during inference.
In our experiments, the proposed DL-KDD framework outperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48 datasets.
- Score: 2.941253902145271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition in dark videos is a challenging task for computer vision. Recent research focuses on applying dark enhancement methods to improve the visibility of the video. However, such video processing results in the loss of critical information in the original (un-enhanced) video. Conversely, traditional two-stream methods are capable of learning information from both original and processed videos, but it can lead to a significant increase in the computational cost during the inference phase in the task of video classification. To address these challenges, we propose a novel teacher-student video classification framework, named Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD). This framework enables the model to learn from both original and enhanced video without introducing additional computational cost during inference. Specifically, DL-KDD utilizes the strategy of knowledge distillation during training. The teacher model is trained with enhanced video, and the student model is trained with both the original video and the soft target generated by the teacher model. This teacher-student framework allows the student model to predict action using only the original input video during inference. In our experiments, the proposed DL-KDD framework outperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48 datasets. We achieve the best performance on each dataset and up to a 4.18% improvement on Dark-48, using only original video inputs, thus avoiding the use of two-stream framework or enhancement modules for inference. We further validate the effectiveness of the distillation strategy in ablative experiments. The results highlight the advantages of our knowledge distillation framework in dark human action recognition.
Related papers
- DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - CLearViD: Curriculum Learning for Video Description [3.5293199207536627]
Video description entails automatically generating coherent natural language sentences that narrate the content of a given video.
We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task.
The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.
arXiv Detail & Related papers (2023-11-08T06:20:32Z) - Paxion: Patching Action Knowledge in Video-Language Foundation Models [112.92853632161604]
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.
Recent video-language models' impressive performance on various benchmark tasks reveal their surprising deficiency (near-random performance) in action knowledge.
We propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective.
arXiv Detail & Related papers (2023-05-18T03:53:59Z) - Ego-Vehicle Action Recognition based on Semi-Supervised Contrastive
Learning [0.0]
We show that proper video-to-video distances can be defined by focusing on ego-vehicle actions.
Existing methods based on supervised learning cannot handle videos that do not fall into predefined classes.
We propose a method based on semi-supervised contrastive learning.
arXiv Detail & Related papers (2023-03-02T05:19:31Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Cross-category Video Highlight Detection via Set-based Learning [55.49267044910344]
We propose a Dual-Learner-based Video Highlight Detection (DL-VHD) framework.
It learns the distinction of target category videos and the characteristics of highlight moments on source video category.
It outperforms five typical Unsupervised Domain Adaptation (UDA) algorithms on various cross-category highlight detection tasks.
arXiv Detail & Related papers (2021-08-26T13:06:47Z) - Self-Supervised Video Representation Learning with Meta-Contrastive
Network [10.768575680990415]
We propose a Meta-Contrastive Network (MCN) to enhance the learning ability of self-supervised approaches.
For two downstream tasks, i.e., video action recognition and video retrieval, MCN outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-19T01:21:13Z) - Privileged Knowledge Distillation for Online Action Detection [114.5213840651675]
Online Action Detection (OAD) in videos is proposed as a per-frame labeling task to address the real-time prediction tasks.
This paper presents a novel learning-with-privileged based framework for online action detection where the future frames only observable at the training stages are considered as a form of privileged information.
arXiv Detail & Related papers (2020-11-18T08:52:15Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.