Related papers: Improving Token-based Object Detection with Video

Improving Token-based Object Detection with Video

URL: http://arxiv.org/abs/2506.22562v1
Date: Fri, 27 Jun 2025 18:21:58 GMT
Title: Improving Token-based Object Detection with Video
Authors: Abhineet Singh, Nilanjan Ray,
Abstract summary: This paper improves upon the Pix2Seq object detector by extending it for videos.<n>It represents objects as variable-length sequences of discrete tokens.<n>It outputs the video objects as fully integrated and indivisible 3D boxes or tracklets.
Score: 4.995343972237369
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes together to construct the video object, as done in most conventional detectors. This allows it to scale effortlessly with available computational resources by simply increasing the length of the video subsequence that the network takes as input, even generalizing to multi-object tracking if the subsequence can span the entire video. We compare our video detector with the baseline Pix2Seq static detector on several datasets and demonstrate consistent improvement, although with strong signs of being bottlenecked by our limited computational resources. We also compare it with several video detectors on UA-DETRAC to show that it is competitive with the current state of the art even with the computational bottleneck. We make our code and models publicly available.

Related papers

SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos. We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z)
Follow Anything: Open-set detection, tracking, and following in real-time [89.83421771766682]
We present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed follow anything'' (FAn), is an open-vocabulary and multimodal model. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second.
arXiv Detail & Related papers (2023-08-10T17:57:06Z)
Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z)
Multi-Attention Network for Compressed Video Referring Object Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z)
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z)
Recent Trends in 2D Object Detection and Applications in Video Event Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning. We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z)
Video Salient Object Detection via Contrastive Features and Attention Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection. A co-attention formulation is utilized to combine the low-level and high-level features. We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z)
Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration. We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z)
Deep Video Inpainting Detection [95.36819088529622]
Video inpainting detection localizes an inpainted region in a video both spatially and temporally. VIDNet, Video Inpainting Detection Network, contains a two-stream encoder-decoder architecture with attention module.
arXiv Detail & Related papers (2021-01-26T20:53:49Z)
Robust and efficient post-processing for video object detection [9.669942356088377]
This work introduces a novel post-processing pipeline that overcomes some of the limitations of previous post-processing methods. Our method improves the results of state-of-the-art specific video detectors, specially regarding fast moving objects. And applied to efficient still image detectors, such as YOLO, provides comparable results to much more computationally intensive detectors.
arXiv Detail & Related papers (2020-09-23T10:47:24Z)
Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing [15.162117090697006]
We propose Pack and Detect, an approach to reduce the computational requirements of object detection in videos. Experiments using the ImageNet video object detection dataset indicate that PaD can potentially reduce the number of FLOPS required for a frame by $4times$.
arXiv Detail & Related papers (2018-09-05T19:29:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.