Use Your Head: Improving Long-Tail Video Recognition
- URL: http://arxiv.org/abs/2304.01143v1
- Date: Mon, 3 Apr 2023 17:09:47 GMT
- Title: Use Your Head: Improving Long-Tail Video Recognition
- Authors: Toby Perrett, Saptarshi Sinha, Tilo Burghardt, Majid Mirmehdi, Dima
Damen
- Abstract summary: We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties.
We propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
- Score: 28.506807977493434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an investigation into long-tail video recognition. We
demonstrate that, unlike naturally-collected video datasets and existing
long-tail image benchmarks, current video benchmarks fall short on multiple
long-tailed properties. Most critically, they lack few-shot classes in their
tails. In response, we propose new video benchmarks that better assess
long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
We then propose a method, Long-Tail Mixed Reconstruction, which reduces
overfitting to instances from few-shot classes by reconstructing them as
weighted combinations of samples from head classes. LMR then employs label
mixing to learn robust decision boundaries. It achieves state-of-the-art
average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and
VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmr
Related papers
- Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark.
vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning.
We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z) - VideoLT: Large-scale Long-tailed Video Recognition [100.15503884988736]
We introduce VideoLT, a large-scale long-tailed video recognition dataset.
Our VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution.
We propose FrameStack, a simple yet effective method for long-tailed video recognition task.
arXiv Detail & Related papers (2021-05-06T13:47:44Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - ResLT: Residual Learning for Long-tailed Recognition [64.19728932445523]
We propose a more fundamental perspective for long-tailed recognition, i.e., from the aspect of parameter space.
We design the effective residual fusion mechanism -- with one main branch optimized to recognize images from all classes, another two residual branches are gradually fused and optimized to enhance images from medium+tail classes and tail classes respectively.
We test our method on several benchmarks, i.e., long-tailed version of CIFAR-10, CIFAR-100, Places, ImageNet, and iNaturalist 2018.
arXiv Detail & Related papers (2021-01-26T08:43:50Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z) - Exploring Long Tail Visual Relationship Recognition with Large
Vocabulary [40.51076584921913]
We make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR)
LTVRR aims at improving the learning of structured visual relationships that come from the long-tail.
We introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets.
arXiv Detail & Related papers (2020-03-25T19:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.