Related papers: Use Your Head: Improving Long-Tail Video Recognition

Use Your Head: Improving Long-Tail Video Recognition

URL: http://arxiv.org/abs/2304.01143v1
Date: Mon, 3 Apr 2023 17:09:47 GMT
Title: Use Your Head: Improving Long-Tail Video Recognition
Authors: Toby Perrett, Saptarshi Sinha, Tilo Burghardt, Majid Mirmehdi, Dima Damen
Abstract summary: We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. We propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
Score: 28.506807977493434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction, which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmr

Related papers

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs) We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark. vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning. We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z)
VideoLT: Large-scale Long-tailed Video Recognition [100.15503884988736]
We introduce VideoLT, a large-scale long-tailed video recognition dataset. Our VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution. We propose FrameStack, a simple yet effective method for long-tailed video recognition task.
arXiv Detail & Related papers (2021-05-06T13:47:44Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
ResLT: Residual Learning for Long-tailed Recognition [64.19728932445523]
We propose a more fundamental perspective for long-tailed recognition, i.e., from the aspect of parameter space. We design the effective residual fusion mechanism -- with one main branch optimized to recognize images from all classes, another two residual branches are gradually fused and optimized to enhance images from medium+tail classes and tail classes respectively. We test our method on several benchmarks, i.e., long-tailed version of CIFAR-10, CIFAR-100, Places, ImageNet, and iNaturalist 2018.
arXiv Detail & Related papers (2021-01-26T08:43:50Z)
Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach. We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks. We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z)
Exploring Long Tail Visual Relationship Recognition with Large Vocabulary [40.51076584921913]
We make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR) LTVRR aims at improving the learning of structured visual relationships that come from the long-tail. We introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets.
arXiv Detail & Related papers (2020-03-25T19:03:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.