Use Your Head: Improving Long-Tail Video Recognition
- URL: http://arxiv.org/abs/2304.01143v1
- Date: Mon, 3 Apr 2023 17:09:47 GMT
- Title: Use Your Head: Improving Long-Tail Video Recognition
- Authors: Toby Perrett, Saptarshi Sinha, Tilo Burghardt, Majid Mirmehdi, Dima
Damen
- Abstract summary: We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties.
We propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
- Score: 28.506807977493434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an investigation into long-tail video recognition. We
demonstrate that, unlike naturally-collected video datasets and existing
long-tail image benchmarks, current video benchmarks fall short on multiple
long-tailed properties. Most critically, they lack few-shot classes in their
tails. In response, we propose new video benchmarks that better assess
long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
We then propose a method, Long-Tail Mixed Reconstruction, which reduces
overfitting to instances from few-shot classes by reconstructing them as
weighted combinations of samples from head classes. LMR then employs label
mixing to learn robust decision boundaries. It achieves state-of-the-art
average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and
VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmr
Related papers
- ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
We introduce a training-free method, $bfReTaKe$, to reduce both temporal visual redundancy and knowledge redundancy for long video understanding.
DPSelect identifies Videos with local maximum peak distance based on their visual features, which are closely aligned with human video perception.
PivotKV employs VideoBenchs as pivots and conducts KV-Cache compression for the non-text tokens with low attention scores.
arXiv Detail & Related papers (2024-12-29T15:42:24Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - VideoLT: Large-scale Long-tailed Video Recognition [100.15503884988736]
We introduce VideoLT, a large-scale long-tailed video recognition dataset.
Our VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution.
We propose FrameStack, a simple yet effective method for long-tailed video recognition task.
arXiv Detail & Related papers (2021-05-06T13:47:44Z) - ResLT: Residual Learning for Long-tailed Recognition [64.19728932445523]
We propose a more fundamental perspective for long-tailed recognition, i.e., from the aspect of parameter space.
We design the effective residual fusion mechanism -- with one main branch optimized to recognize images from all classes, another two residual branches are gradually fused and optimized to enhance images from medium+tail classes and tail classes respectively.
We test our method on several benchmarks, i.e., long-tailed version of CIFAR-10, CIFAR-100, Places, ImageNet, and iNaturalist 2018.
arXiv Detail & Related papers (2021-01-26T08:43:50Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z) - Exploring Long Tail Visual Relationship Recognition with Large
Vocabulary [40.51076584921913]
We make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR)
LTVRR aims at improving the learning of structured visual relationships that come from the long-tail.
We introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets.
arXiv Detail & Related papers (2020-03-25T19:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.