Navigating an Ocean of Video Data: Deep Learning for Humpback Whale
Classification in YouTube Videos
- URL: http://arxiv.org/abs/2212.00822v1
- Date: Thu, 1 Dec 2022 19:19:46 GMT
- Title: Navigating an Ocean of Video Data: Deep Learning for Humpback Whale
Classification in YouTube Videos
- Authors: Michelle Ramirez
- Abstract summary: We use a CNNRNN architecture pretrained on the ImageNet dataset for classification of YouTube videos as relevant or irrelevant.
We achieve an average 85.7% accuracy, and 84.7% (irrelevant)/ 86.6% (relevant) F1 scores using five-fold cross validation.
We show that deep learning can be used as a time-efficient step to make social media a viable source of image and video data for biodiversity assessments.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image analysis technologies empowered by artificial intelligence (AI) have
proved images and videos to be an opportune source of data to learn about
humpback whale (Megaptera novaeangliae) population sizes and dynamics. With the
advent of social media, platforms such as YouTube present an abundance of video
data across spatiotemporal contexts documenting humpback whale encounters from
users worldwide. In our work, we focus on automating the classification of
YouTube videos as relevant or irrelevant based on whether they document a true
humpback whale encounter or not via deep learning. We use a CNN-RNN
architecture pretrained on the ImageNet dataset for classification of YouTube
videos as relevant or irrelevant. We achieve an average 85.7% accuracy, and
84.7% (irrelevant)/ 86.6% (relevant) F1 scores using five-fold cross validation
for evaluation on the dataset. We show that deep learning can be used as a
time-efficient step to make social media a viable source of image and video
data for biodiversity assessments.
Related papers
- Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - Revisiting Feature Prediction for Learning Visual Representations from Video [62.08833572467379]
V-JEPA is a collection of vision models trained solely using a feature prediction objective.
The models are trained on 2 million videos collected from public datasets.
Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks.
arXiv Detail & Related papers (2024-02-15T18:59:11Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis [60.13902294276283]
We present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated).
Many of the existing deepfake datasets focus exclusively on two types of facial manipulations -- swapping with a different subject's face or altering the existing face.
Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VideoSham.
arXiv Detail & Related papers (2022-07-26T17:39:04Z) - Misinformation Detection on YouTube Using Video Captions [6.503828590815483]
This work proposes an approach that uses state-of-the-art NLP techniques to extract features from video captions (subtitles)
To evaluate our approach, we utilize a publicly accessible and labeled dataset for classifying videos as misinformation or not.
arXiv Detail & Related papers (2021-07-02T10:02:36Z) - Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z) - Creating a Large-scale Synthetic Dataset for Human Activity Recognition [0.8250374560598496]
We use 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos.
We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes.
arXiv Detail & Related papers (2020-07-21T22:20:21Z) - Ensembles of Deep Neural Networks for Action Recognition in Still Images [3.7900158137749336]
We propose a transfer learning technique to tackle the lack of massive labeled action recognition datasets.
We also use eight different pre-trained CNNs in our framework and investigate their performance on Stanford 40 dataset.
The best setting of our method is able to achieve 93.17$%$ accuracy on the Stanford 40 dataset.
arXiv Detail & Related papers (2020-03-22T13:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.