Related papers: Open-World Object Counting in Videos

Open-World Object Counting in Videos

URL: http://arxiv.org/abs/2506.15368v1
Date: Wed, 18 Jun 2025 11:35:30 GMT
Title: Open-World Object Counting in Videos
Authors: Niki Amini-Naieni, Andrew Zisserman,
Abstract summary: We introduce a new task of open-world object counting in videos.<n>The objective is to enumerate all the unique instances of the target objects in the video.<n>We introduce a model, CountVid, for this task.
Score: 55.2480439325792
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.

Related papers

Object-centric Video Question Answering with Visual Grounding and Referring [43.963739052764595]
We introduce a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks.<n>We also propose STOM, a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video.<n>We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation.
arXiv Detail & Related papers (2025-07-25T18:11:23Z)
Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.<n>We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.<n>In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z)
OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos [58.5538620720541]
The dataset, OVR, contains annotations for over 72K videos. OVR is almost an order of magnitude larger than previous datasets for video repetition. We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
arXiv Detail & Related papers (2024-07-24T08:22:49Z)
1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. We trained our model on a large-scale video object segmentation dataset. Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z)
A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video [27.329015161325962]
Indiscernible object counting, which aims to count the number of targets that are blended with respect to their surroundings, has been a challenge. We propose a large-scale dataset called YoutubeFish-35, which contains a total of 35 sequences of high-definition videos. We propose TransVidCount, a new strong baseline that combines density and regression branches along the temporal domain in a unified framework.
arXiv Detail & Related papers (2024-03-06T04:54:00Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
Uncertainty Aware Active Learning for Reconfiguration of Pre-trained Deep Object-Detection Networks for New Target Domains [0.0]
Object detection is one of the most important and fundamental aspects of computer vision tasks. To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format. Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from. In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem.
arXiv Detail & Related papers (2023-03-22T17:14:10Z)
VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video. The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z)
VideoClick: Video Object Segmentation with a Single Click [93.7733828038616]
We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video. In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background. Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
arXiv Detail & Related papers (2021-01-16T23:07:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.