Related papers: Context Enhanced Transformer for Single Image Object Detection

Context Enhanced Transformer for Single Image Object Detection

URL: http://arxiv.org/abs/2312.14492v2
Date: Tue, 26 Dec 2023 05:54:22 GMT
Title: Context Enhanced Transformer for Single Image Object Detection
Authors: Seungjun An, Seonghoon Park, Gyeongnyeon Kim, Jeongyeol Baek, Byeongwon Lee, Seungryong Kim
Abstract summary: We propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR) To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. We present a classification-based sampling technique to selectively utilize the relevant memory for the current image.
Score: 31.52466523847246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.

Related papers

Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames. Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs. This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z)
Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data. Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z)
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z)
Dual Memory Aggregation Network for Event-Based Object Detection with Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner. Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation. Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z)
A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data. Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z)
ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations. We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings. We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z)
Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework. It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z)
Recent Trends in 2D Object Detection and Applications in Video Event Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning. We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z)
Robust and efficient post-processing for video object detection [9.669942356088377]
This work introduces a novel post-processing pipeline that overcomes some of the limitations of previous post-processing methods. Our method improves the results of state-of-the-art specific video detectors, specially regarding fast moving objects. And applied to efficient still image detectors, such as YOLO, provides comparable results to much more computationally intensive detectors.
arXiv Detail & Related papers (2020-09-23T10:47:24Z)
MuCAN: Multi-Correspondence Aggregation Network for Video Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame. Inter- and intra-frames are the key sources for exploiting temporal and spatial information. We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.